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Preface 


The journey of this book began at the end of 2016 when preparing material for a statistics course 
for The University of Queensland. At the time, the Julia language was already showing itself as a 
powerful new and applicable tool, even though it was only at version 0.5. For this reason, we chose 
Julia for use in the course. By exposing students to statistics with Julia early on, they would be 
able to employ Julia for data science, numerical computation and machine learning tasks later in 
their careers. This choice was not without some resistance from students and colleagues, since back 
then, as is still now in 2020, in terms of volume, the R language dominates the world of statistics, 
in the same way that Python dominates the world of machine-learning. So why Julia? 


There were three main reasons: performance, simplicity and flexibility. Julia is quickly becoming 
a major contending language in the world of data science, statistics, machine learning, artificial 
intelligence, and general scientific computing. It is easy to use like R, Python, and Matlab, but due 
to its type system and just-in-time compilation, it performs computations much more efficiently. 
This enables it to be fast, not just in terms of run time, but also in terms of development time. In 
addition, there are many different Julia packages. These include advanced methods for the data- 
scientist, statistician, or machine learning practitioner. Hence the language has a broad scope of 
application. 


Our goal in writing this book was to create a resource for understanding the fundamental 
concepts of statistics needed for mastering machine learning, data science and artificial intelligence. 
This is with a view of introducing the reader to Julia through the use of it as a computational tool. 
The book also aims to serve as a reference for the data scientist, machine learning practitioner, 
bio-statistician, finance professional, or engineer, who has either studied statistics before, or wishes 
to fill gaps in their understanding. In today’s world, such students, professionals, or researchers 
often use advanced methods and techniques. However, one is often required to take a step back and 
explore or revisit fundamental concepts. Revisiting these concepts with the aid of a programming 
language such as Julia immediately makes the concepts concrete. 


Now, 4 years since we embarked on this book writing journey, Julia has matured beyond v1.0, 
and the book has matured along with it. Julia can be easily deployed by anyone who wishes to 
use it. However, currently many of Julia’s users are hard-core developers that contribute to the 
language’s standard libraries, and to the extensive package eco-system that surrounds it. Therefore, 
much of the Julia material available at present is aimed at other developers rather than end users. 
This is where our book comes in, as it has been written with the end-user in mind. 


This book is about statistics, probability, data science, machine learning and artificial intelli- 
gence. By reading it you should be able to gain a basic understanding of the concepts that underpin 
these fields. However in contrast to books that focus on theory, this book is code example centric. 
Almost all of the concepts that we introduce are backed by illustrative code examples. Similarly 
almost all of the figures are generated via the code examples. The code examples have been deliber- 
ately written in a simple format, sometimes at the expense of efficiency and generality, but with the 
advantage of being easily readable. Each of the code examples aims to convey a specific statistical 
point, while covering Julia programming concepts in parallel. The code examples are reminiscent 
of examples that a lecturer may use in a lecture to illustrate concepts. The content of the book is 
written in a manner that does not assume any prior statistical knowledge, and in fact only assumes 
some basic programming experience and a basic understanding of mathematical notation. 


ii 


As you read this book, you can also run the code examples yourself. You may experiment by 
modifying parameters in the code examples or making any other modification that you can think 
of. With the exception of a few introductory examples, most of the code examples rarely focus on 
the Julia language directly but are rather meant to illustrate statistical concepts. They are then 
followed by a brief description dealing with specific Julia language issues. Nevertheless, if learning 
Julia is your focus, by using and experimenting with the examples you can learn the basics of Julia 
as well. T'he code examples can be downloaded from the book's GitHub repository: 


https://github.com/h-Klok/StatsWithJuliaBook 


Further, an erratum, and an electronic version of Appendix A's how-to guide, can be found in 
the book's website: 


httos://statisticswithjulia.org/ 





The book contains a total of 10 chapters. The content may be read continuously, or accessed in an 
ad-hoc manner. The structure of the individual chapters is as follows: 


Chapter [1]is an introduction to Julia, including its setup, package manager, and a list of the main 
packages used in the book. The reader is introduced to some basic Julia syntax, and programmatic 
structure through code examples that aim to illustrate some of the language's basic features. As it 
is central to the book, basics of random number generation are also introduced. Further, examples 
dealing with integration with other languages including R and Python are presented. 


Chapter [2] explores basic probability, with a focus on events, outcomes, independence, and con- 
ditional probability concepts. Several typical probability examples are presented, along with ex- 
ploratory simulation code. 


Chapter |3| explores random variables and probability distributions, with a focus on the use of 
Julia's Distributions package. Discrete, continuous, univariate, and multi-variate probability 
distributions are introduced and explored as an insightful and pedagogical task. This is done 
through both simulation and explicit analysis, along with the graphing of associated functions of 
distributions, such as the PMF, PDF, CDF, and quantiles. 


Chapter [4] momentarily departs from probabilistic notions to focus on data processing, data sum- 
mary and data visualizations. The concept of the DataFrame is introduced as a mechanism for 
storing heterogeneous data types with the possibility of missing values. Data frames play an integral 
component of data science and statistics in Julia, just as they do in R and Python. A summary of 
classic descriptive statistics and their application in Julia is also introduced. This is augmented by 
the inclusion of concepts such as Kernel Density Estimation and the empirical cumulative distribu- 
tion function. The chapter closes with some basic functionality for working with files. 


Chapter [5] introduces general statistical inference ideas. The sampling distributions of the sample 
mean and sample variance are presented through simulation and analytic examples, illustrating 
the central limit theorem and related results. Then general concepts of statistical estimation are 
explored, including basic examples of the method of moments and maximum likelihood estimation, 


followed by simple confidence bounds. Basic notions of statistical hypothesis testing are introduced, 
and finally the chapter is closed by touching basic ideas of Bayesian statistics. 


Chapter [6] covers a variety of practical confidence intervals for both one and two samples. The 
chapter starts with standard confidence intervals for means, and then progresses to the more modern 
bootstrap method and prediction intervals. The chapter also serves as an entry point for investi- 
gating the effects of model assumptions on inference. 


Chapter [7] focuses on hypothesis testing. The chapter begins with standard t-tests for population 
means, and then covers hypothesis tests for the comparison of two means. Then, Analysis of 
Variance (ANOVA) is covered, along with hypothesis tests for checking independence and goodness 
of fit. The reader is then introduced to power curves. 


Chapter [8] covers least squares, statistical linear regression models, generalized models, and a touch 
of time series. It begins by covering least squares and then moves onto the linear regression statistical 
model, including hypothesis tests and confidence bands. Additional concepts of regression are also 
explored. These include assumption checking, model selection, interactions and more. Generalized 
linear models are introduced and an introduction to time series analysis is also presented. 


Chapter [O] provides a broad overview of machine learning concepts. The concepts presented include 
supervised learning, unsupervised learning, reinforcement learning, and generative adversarial net- 
works. In a sprint presenting methods for dealing with such problems, examples illustrate multiple 
machine learning methods including random forests, support vector machines, clustering, principal 
component analysis, deep learning, and more. 


Chapter moves on to dynamic stochastic models in applied probability, giving the reader an 
indication of the strength of stochastic modeling and Monte-Carlo simulation. It focuses on dynamic 
systems, where Markov chains, discrete event simulation, and reliability analysis are explored, along 
with several aspects dealing with random number generation. It also includes examples of basic 
epidemic modeling. 


In addition to the core material, the book also contains 3 appendices. Appendix [Alis particularly 
useful as it can serve as a basic “how to guide” for Julia, calling upon the many examples in the 
chapters for reference. 


Appendix |A| contains a list of many useful items detailing “how to perform ... in Julia”, where 
the reader is directed to specific code examples that deal directly with these items. 


Appendix [B| lists additional language features of the Julia language that were not used by the 
code examples in this book. 


Appendix [C] lists additional Julia packages related to statistics, machine learning, data science, 
and artificial intelligence that were not used in this book. 


Whether you are an industry professional, a student, an educator, a researcher, or an enthusiast, 
we hope that you find this book useful. Use it to expand your knowledge in fundamentals of statistics 
with a view towards machine learning, artificial intelligence, and data science. We further hope that 
the integration of Julia code and the content that we present help you quickly apply Julia for such 
purposes. 


We would like to thank many colleagues, family members and friends for their feedback, com- 
ments and suggestions. These include, Julianna Forbes, Milan Bouchet-Valat, Heidi Dixon, Jaco Du 
Plessis, Vaughan Evans, Liam Hodgkinson, Bogumit Kaminski, Dirk Kroese, Benoit Liquet, Ruth 
Luscombe, Geoff McLachlan, Moshe Nazarathy, Robert Salomone, Vincent Tam, Sérgio Bacelar, 
Alex Stenlake, James Tanton, and others. In particular, we thank Vektor Dewanto for detailed 
feedback, and for catching dozens of typos and errors. We also thank Joe Grotowski and Matt 
Davis from The University of Queensland for additional help dealing with the publishing process. 
Yoni Nazarathy would also like like to acknowledge the Australian Research Council (ARC) for 
supporting part of this work via Discovery Project grant DP180101602. 


Yoni Nazarathy and Hayden Klok. 
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Chapter 1 


Introducing Julia - DRAFT 


Programming goes hand in hand with mathematics, statistics, data science and many other 
fields. Scientists, engineers, data scientists and statisticians often need to automate computation 
that would otherwise take too long or be infeasible to carry out. This is for the purpose of prediction, 
planning, analysis, design, control, visualization, or as an aid for theoretical research. Often, general 
programming languages such as Fortran, C/C++, Java, Swift, C#, Go, JavaScript, or Python are 
used. In other cases, more mathematical /statistical programming languages such as Mathematica, 
Matlab/Octave, R, or Maple are employed. The process typically involves analyzing the problem 
at hand, writing code, analyzing behavior and output, re-factoring, iterating and improving the 
model. At the end of the day, a critical component is speed, specifically, the speed it takes to reach 
a solution - whatever it may be. 


When trying to quantify speed, the answer is not simple. On the one hand, speed can be 
quantified in terms of how fast a piece of computer code runs, namely runtime speed. On the other 
hand, speed can be quantified in terms of how fast it takes to code, debug and re-factor computer 
code, namely development speed. Within the realm of scientific computing and statistical computing, 
compiled low-level languages such as Fortran or C/C++ generally yield fast runtime performance, 
however require more care in creation of the code. Hence they are generally fast in terms of runtime, 
yet slow in terms of development time. On the opposite side of the spectrum are mathematically 
specialized languages such as Mathematica, R, Matlab, as well as Python. These typically allow 
for more flexibility when creating code, hence generally yield quicker development times. However, 
runtimes are typically significantly slower than what can be achieved with a low-level language. 
In fact, many of the efficient statistical and scientific computing packages incorporated in these 
languages are written in low-level languages, such as Fortran or C/C++, which allow for faster 
runtimes when applied as closed modules. 


A practitioner wanting to use a computer for statistical and mathematical analysis often faces 
a trade-off between runtime and development time. While speed (both development and runtime) 
is hard to fully and fairly quantify, Figure illustrates a schematic view showing general speed 
trade-offs between languages. As is postulated in this figure, there is a type of a Pareto optimal 
frontier ranging from the C language on one end to the R language on the other. The location of 
each language on this figure cannot be determined exactly. However, few would disagree that “R is 
generally faster to code than C” and “C generally runs faster than R”. So, what about Julia? 
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Figure 1.1: A schematic of run speed vs. development speed. 
Observe the Pareto-optimal frontier existing prior to Julia. 


The Julia language and framework developed in the last decade makes use of a variety of advances 
in compilation, computer languages, scientific computation and performance optimization. It is a 
language designed with a view of improving on the previous Pareto-optimal frontier depicted in 
Figure With syntax, style, and feel somewhat similar to R, Python and Matlab/Octave, and 
with performance comparable to that of C/C++ and Fortran, Julia attempts to break the so called 
two-language problem. 'l'hat is, it is postulated that practitioners may quickly create code in Julia, 
which also runs quickly. Further, re-factoring, improving, iterating and optimizing code can be done 
in Julia, and does not require the code to be ported to C/C++ or Fortran. In contrast to Python, 
R and other high level languages, the Julia standard libraries, and almost all of the Julia code base 
is written in Julia. 


Following this discussion about development speed and runtime speed, we make a rather sharp 
turn. We focus on learning speed. In this context, we focus on learning how to use Julia and 
in the same process learning and/or strengthening knowledge of statistics, machine learning, data 
science, and artificial intelligence. In this respect, with the exception of some minor discussions 
in Section "runtime speed and performance" is seldom mentioned in the book. It is rather 
axiomatically obtained by using Julia. Similarly, coding and complex project development speed is 
not our focus. Again, the fact that Julia feels like a high-level language, very similar to Python, 
immediately suggests it is practical to code complex projects quickly in the language. Our focus is 
on learning quickly. 


By following the code examples in this book (there are over 200), we allow you to learn how 
to use the basics of Julia quickly and efficiently. In the same go, we believe that this book will 
strengthen or build your understanding of probability, statistics, machine learning, data science, 
and artificial intelligence. In fact, the book contains a self contained overview of these fields, taking 
the reader through a tour of many concepts, illustrated via Julia code examples. Even if you are a 
seasoned statistician, data-scientist, machine learner, or probabilist, we are confident that you will 
find some of our discussions and examples interesting and gain further insight. 


Question: Do I need to have any statistics or probability knowledge to read this book? 

Answer: Statistics or probability knowledge is not pre-assumed. Hence, this book is a self-contained 
guide for the core principles of probability, statistics, machine learning, data science, and artificial 
intelligence. It is ideally suited for engineers, data-scientists, or science professionals, wishing to 
strengthen their core probability, statistics, and data science knowledge while exploring the Julia 
language. However, general mathematical notation and results including basics from linear algebra, 
calculus, and discrete mathematics are used. 


Question: What experience in programming is needed in-order to use this book? 

Answer: While this book is not an introductory programming book, it does not assume that the 
reader is a professional software developer. Any reader that has coded in some other language at a 
basic level, will be able to follow the code examples and their descriptions. 


Question: How to read the book? 

Answer: You may either read the book sequentially, or explore ideas and code examples in an ad- 
hoc manner. This book is code example centric. The code examples are the backbone of the story 
with each example illustrating a statistical concept together with the text, figures, and formulas 
that surround it. Y In any case, feel free to use the code-repository on GitHub: 


https://github.com/h-Klok/StatsWithJuliaBook 


As you do so, you can try to modify the code in the examples to experiment with various aspects 
of the statistical phenomena being presented. You may often modify numerical parameters and 
see what effect your modification has on the output. For ad-hoc Julia help, you may also use 
Appendix [A] “How-to in Julia". it directs you to individual code listings that contain specific 
examples of “how to”. It is also searchable online. 


Question: What are the unifying features of the code examples? 

Answer: With the exception of a few examples focusing on Julia basics, most code examples in this 
book are meant to illustrate statistical concepts. Each example is designed to run autonomously 
and to fit on a single page. Hence the code examples are often not optimized for efficiency and 
modularity. Instead, the goal is always to “get the job done” in the clearest, cleanest, and simplest 
way possible. With the aid of the code, you will pick up Julia syntax, structure, and package 
usage. However, you should not treat the code as ideal scientific programming code but rather as 
illustrative code for presenting and exploring basic concepts. 


The remainder of this chapter is structured as follows: In Section[T.1]we present a brief overview 
of the Julia language. In Section we describe some options for setting up a Julia working 
environment presenting the REPL and JuliaBox. Then in Section we dive into Julia code 
examples designed to highlight basic powerful language features. We continue in Section [1.4] where 
we present code examples for plotting and graphics. Then in Section[1.5]we overview random number 
generation and the Monte Carlo method, used throughout the book. We close with Section [1.6}where 
we illustrate how other languages such as Python, R, and C can be easily integrated with your Julia 
code. If you are a newcomer to statistics, then it is possible that some of the examples covered in the 
first chapter are based on ideas that you have not previously touched. The purpose of the examples 
is to illustrate key aspects of the Julia language in this context. Hence, if you find the examples of 
the first chapter overwhelming, feel free to advance to the next chapter where elementary probability 
is introduced starting with basic principles. The content then builds up from there gradually. 
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1.1 Language Overview 


We now embark on a very quick tour of Julia. We start by overviewing language features in broad 
terms and continue with several basic code examples. This section is in no way a comprehensive 
description of the programming language and its features. Rather, it aims to overview a few select 
language features and introduce minimal basics. 


About Julia 


Julia is first and foremost a scientific programming language. It is perfectly suited for statistics, 
machine learning, data science, as well as for light and heavy numerical computational tasks. It can 
also be integrated in user-level applications, however one would not typically use it for front-end 
interfaces, or game creation. It is an open-source language and platform, and the Julia community 
brings together contributors from the scientific computing, statistics, and data-science worlds. This 
puts the Julia language and package system in a good place for combining mainstream statistical 
methods with methods and trends of the scientific computing world. Coupled with programmatic 
simplicity similar to Python, and with speed similar to C, Julia is taking an active part of the 
data-science revolution. In fact, some believe it may overtake Python and R to become the primary 


language of data-science in the future. Visit https://julialang.org/|for more details. 


We now discuss a few of the languages main features. If you are relativity new to programming, 
you may want to skip this discussion, and move to the subsection below which deals with a few basic 
commands. A key distinction between Julia and other high-level scientific computing languages is 
that Julia is strongly typed. This means that every variable or object has a distinct type that can 
either explicitly or implicitly be defined by the programmer. This allows the Julia system to work 
efficiently and integrates well with Julia’s just-in-time (JIT) compiler. However, in contrast to low 
level strongly-typed languages, Julia alleviates the user from having to be “type-aware” whenever 
possible. In fact, many of the code examples in this book, do not explicitly specify types. That is, 
Julia features optional typing, and when coupled with Julia’s multiple dispatch and type inference, 
Julia’s JIT compilation system creates fast running code (compiled to LLVM), that is also very easy 
to program and understand. 


The core Julia language imposes very little, and in fact the standard Julia libraries, and almost 





all of Julia Base, is written in Julia itself. Even primitive operations such as integer arithmetic are 
written in Julia. The language features a variety of additional packages, some of which are used in 
this book. All of these packages, including the language and system itself, are free and open source 
(MIT licensed). There are dozens of features of the language that can be mentioned. While it is 
possible, there is no need to vectorize code for performance. There is efficient support for Unicode, 
including but not limited to UTF-8. C can be called directly from Julia. There are even Lisp-like 
macros, and other metaprogramming facilities. 


Julia development started in 2009 by Jeff Bezanson, Stefan Karpinski, Viral Shah, and Alan 
Edelman. The language was launched in 2012 and has grown significantly since then, with the 
current version 1.4 as of the middle of 2020. While the language and implementation are open source, 
the commercial company Julia Computing provides services and support for schools, universities, 
business, and enterprises that wish to use Julia. 
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A Few Basic Commands 


Julia is a complete programming language supporting various programming paradigms including 
procedural programming, object oriented programming, meta-programming and functional program- 
ming. It is useful for numerical computations, data processing, visualization, parallel computing, 
network input and output, and much more. 


As with any programming language you need to start somewhere. We start with an extended 
“Hello world”. Look at the code listing below, and the output that follows. If you’ve programmed 
previously, you can probably figure out what each code lines does. We’ve also added a few comments 
to this code example, using #. Read the code below, and look at the output that follows: 


Listing 1.1: ¡Hello world and perfect squares 


println("There is more than one way to say hello:") 


P Lies els uan detay, Consi seing Wor three SELOS 
heloAnrnaye—s Helo exa etiem 


forming 
Dacia Gite aclarar [Lat ]] Y Mireseihel um 
end 





println("\nThese squares are just perfect:") 


# This construct is called a ‘comprehension’ (or 'list comprehension’ ) 
squares = [i^2 for i in 0:10] 


# You can loop on elements of arrays without having to use indexing 
for s in squares 

pr (V — UL) 
end 


# The last line of every code snippet is also evaluated as output (in addition to 
# any figures and printing output generated previously). 
sqrt. (squares) 








There is more than one way to say hello: 
Hello World! 
G'day World! 
Shalom World! 





These squares are just perfect: 

0 1 4 9 16 25 36 49 64 81 100 
11-element Array{Float64,1}: 

0.0 


COO G -I o UOFPWNEHE 
oOOoOoooocococcoocooco 


nd 


6 CHAPTER 1. INTRODUCING JULIA - DRAFT 


Most of the book contains code listings such as Listing [1.1] above. For brevity of future code 
examples, we generally omit comments. Instead most listings are followed by minor comments as 
seen below. 





The printin() function is used for strings such as "There is...hello:". In line 4 we define 
an array consisting of 3 strings. The for loop in lines 6-8 executes three times, with the variable i 
incremented on each iteration. Line 7, is the body of the loop where println() is used to print 
several arguments. The first, "\t" is a tab spacing. The second is the i-th entry of helloArray 
(in Julia array indexing begins with index 1), and the third is an additional string. In line 10 the 
"\n" character is used within the string to signify printing a new line. In line 13, a comprehension 
is defined. It consists of the elements, (i? : i € {0,...,10}}. We cover comprehensions further in 
Listing Lines 16-18 illustrate that loops may be performed on all elements of an array. In this 
case, the loop changes the value of the variable s to another value of the array squares in each 
iteration. Note the use of the print () function to print without a newline. Line 22, the last line of 
the code block applies the sqrt () function on each element of the array squares by using the ‘.’ 
broadcast operator. The expression of the last line of every code block, unless terminated by a “;”, is 
presented as output. In this case, it is an 11-element array of the numbers 0,...,10. The type of 
the output expression is also presented. It is Array (F1oat64,1]. 





When exploring statistics and other forms of numerical computation, it is often useful to use a 
comprehension as a basic programming construct. As explained above, a typical form of a compre- 
hension is: 


[£(x) for x in A] 


Here, A is some array, or more generally a collection of objects. Such a comprehension creates an 
array of elements, where each element x of A is transformed via f (x). Comprehensions are ubiqui- 
tous in the code examples we present in this book. We often use them due to their expressiveness 
and simplicity. We now present a simple additional example: 





Listing 1.2: [Using a comprehension 


acray = (2) Eor sa sin Les] 

array2 = [sqrt(i) for i in arrayl] 

primela (irwoeoit (eS), Y “, teor (euraasia " VL Evyscot (euiew2) ) 
gs, arrayl, curwayyZ 





UnitRange{Int64} Array{Int64,1} Array{Floaté4,1} 
(Leo, 19 25, 49, 81, 121], [3.07 5.0, 720, 9.0, I120]) 
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O — € / $, Multi-dimensional Arrays : The x e 
e CŒ | & Secure | https://docs.julialang.org/en/stable/manual/arrays/#Comprehensions-1 Y o 


Comprehensions 


A li : Comprehensions provide a general and powerful way to construct arrays. Comprehension syntax is similar to 
J u la set construction notation in mathematics: 
MS HEE see) WOE TERS NAO Goo | 
The Julia Language 
stable v The meaning of this form is that F (x,y, . . .) is evaluated with the variables x, y, etc. taking on each value in 
their given list of values. Values can be specified as any iterable object, but will commonly be ranges like 1:n or 
Search docs 2:(n-1), or explicit arrays of values like [1.2, 3.4, 5.7]. The result is an N-d dense array with dimensions 
that are the concatenation of the dimensions of the variable ranges rx, ry, etc. and each F(x,y,...) 
Home evaluation returns a scalar. 
Manual The following example computes a weighted average of the current element and its left and right neighbor 
Introduction along a 1-d grid. : 


Getting Started julia» x = rand(8) 


Variables 8-element Array{Float64, 1}: 
0.843025 

Integers and Floating-Point Numbers 0.869052 

Mathematical Operations and 9.365105 

Elementary Functions AGERE 
0.977653 

Complex and Rational Numbers 0.994953 

A 0.41084 
Strings 8.809411 


Figure 1.2: Visit https://docs.julialang.org 


for official language documentation. 





The array arrayl, is created in line 1 with the elements {(2n +1)? : n € {1,...,5}}, in order. Note 
that while mathematical sets are not ordered, comprehensions generate ordered arrays. Observe the 
literal 2 in the multiplication 2n, without explicit use of the * symbol. In the next line, array2 is 
created. An alternative would be to use sqrt. (array1). In line 3, we print the typeof () three 
expressions. The type of 1:5 (used to create array1) is a UnitRange of Int 64. It is a special type 
of object that encodes the integers 1,...,5 without explicitly allocating memory. Then the types of 
both arrayl and array2 are Array types, and they contain values of types Int64 and Float64 
respectively. In line 4, a tuple of values is created through the use of a comma between 1:5, arrayl 
and array2. Asit is the last line of the code, it is printed as output. Observe that in the output, the 
values of the second element of the tuple are printed as integers (no decimal point) while the values 
of the third element are printed as floating point numbers. 





Getting Help 
You may consult the official Julia documentation, https://docs.julialang.org/ for help. 
The documentation strikes a balance between precision and readability. See Figure [1.2] 


While using Julia, help may be obtained through the use of ?. For example try, ?sqrt and you 
will see output similar to Figure 
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In [1]: |? sqrt 


search: sqrt sqrtm isqrt 


Out[1]: sqrt(x) 


Return yx. Throws DomainError for negative Real arguments. Use complex negative arguments instead. The prefix operator Y is equivalent to 





sqrt. 


Figure 1.3: Snapshot from a Julia Jupyter notebook: Keying in ?sqrt 
presents help for the sqrt () function. 


You may also find it useful to apply the methods () function. Try, methods (sqrt). You will 
see output that contains lines of this sort: 


sqrt (x::Float32) at math.j1:426 

sqrt (x::Float64) at math.j1:425 

::Complex{#s45} where #s45<:AbstractFloat) at complex.31:392 
::Complex) at complex. j1:416 

sqrt (x::Real) at math.j1:434 

sqrt{T<:Number} (x: :AbstractArray{T,N} where N) at deprecated. j1:56 


This presents different Julia methods implementation for the function sqrt (). In Julia, a given 
function may be implemented in different ways depending on different input arguments with each 
different implementation being a method. This is called multiple dispatch. Here, the various methods 
of sqrt () are shown for different types of input arguments. 


Runtime Speed and Performance 


While Julia is fast and efficient, for most of this book we don't explicitly focus on runtime speed 
and performance. Rather, our aim is to help the reader learn how to use Julia while enhancing 
knowledge of probability and statistics. Nevertheless, we now briefly discuss runtime speed and 
performance. 


From a user perspective, Julia feels like an interpreted language as opposed to a compiled lan- 
guage. With Julia, you are not required to explicitly compile your code before it is run. However, as 
you use Julia, behind the scenes, the system's JIT compiler compiles every new function and code 
snippet as it is needed. This often means that on a first execution of a function, runtime is much 
slower than the second, or subsequent runs. From a user perspective, this is apparent when using 
other packages (as the example in Listing below illustrates, this is often done by the using 
command). On a first call (during a session) to the using command of a given package, you may 
sometimes wait a few seconds for the package to compile. However, afterwards, no such wait is 
needed. 


For day to day statistics and scientific computing needs, you often don't need to give much 
thought to performance and run speed with Julia, since Julia is inherently fast. For instance, as 
we do in dozens of examples in this book, simple Monte Carlo simulations involving 10° random 
variables typically run in less than a second, and are very easy to code. However, as you progress 
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into more complicated projects, many repetitions of the same code block may merit profiling and 
optimization of the code in question. Hence, you may wish to carry out basic profiling. 


For basic profiling of performance the @time macro is useful. Wrapping code blocks with it 
(via begin and end) causes Julia to profile the performance of the block. In Listings [3] and [L4] 
we carry out such profiling. In both listings, we populate an array, called data, containing 10° 
values, where each value is a mean of 500 random numbers. Hence, both listings handle half a 
billion numbers. However, Listing [1.3]is a much slower implementation. 


Listing 1.3: Slow code example 


using Statistics 


@time begin 
data = Float64[] 
for AOS 
group = Float64[] 
for Ing 51057 
push! (group, rand() ) 
end 
push! (data,mean (group) ) 
end 
println("98% of the means lie in the estimated range: ", 
(quantile (data,0.01),quantile(data,0.99)) ) 











98% of the means lie in th stimated range: (0.4699623580817418, 0.5299937027991253) 
11.587458 seconds (10.00 M allocations: 8.034 GiB, 4.69% gc time) 


The actual output of the code gives a range, in this case approximately 0.47 to 0.53 where 98% 
of the sample means (averages) lie. We cover more on this type of statistical analysis in the chapters 
that follow. 


The second line of output, generated by @time, states that it took about 11.6 seconds for the 
code to execute. There is also further information indicating how many memory allocations took 
place, in this case about 10 million, totaling just over 8 Gigabytes (in other words, Julia writes a 
little bit, then clears, and repeats this process many times over). This constant read-write is what 
slows our processing time. 


Now, look at Listing [1.4] and its output. 


Listing 1.4: [Fast code example 


using Statistics 


(time begin 


data = [mean(rand(5«10^2)) for _ in 1:10^6] 
println("98% of the means lie in the estimated range: ", 
(quantile (data,0.01) quantile (data, 0.99)) ) 











98% of the means lie in th stimated range: (0.469999864362845, 0.5300834606858865) 
1.705009 seconds (1.01 M allocations: 3.897 GiB, 10.76% gc time) 
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As can be seen, the output gives the same estimate for the interval containing 98% of the means. 
However, in terms of performance, the output of @time indicates that this code is clearly superior. 
It took about 1.7 seconds (compare with 11.6 seconds for Listing [1.3). In this case, the code is much 
faster because far fewer memory allocations are made. Note that ‘gc time’ stands for “garbage 
collection” and quantifies what percentage of the running time Julia was busy with internal memory 
management. 


Here are some comments for both code-listings [1.3] and [1.4] 





In both listings we use the Statistics package, required for the mean () function. Line 4 (List- 
ing[1.3) creates an empty array of type Float 64, data. Line 6 creates an empty array, group. Then 
lines 7-9 loop 500 times, each time pushing to the array, group, a new random value generated from 
rand(). The push! () function here uses the naming convention of having an exclamation mark 
when the function modifies the argument. This is not part of the Julia language, but rather decorates 
the name of the function. In this case, it modifies group by appending another new element. Here 
is one point where the code is inefficient. The Julia compiler has no direct way of knowing how much 
memory to allocate for group initially, hence some of the calls to push! () imply reallocation of the 
array and copying. Line 10 is of a similar nature. The composition of push! () and mean () imply 
that the new mean (average of 500 values) is pushed into data. However, some of these calls to 
push! () imply a reallocation. At some point the allocated space of data will suddenly run out, and 
at this point the system will need to internally allocate new memory, and copy all values to the new 
location. This is a big cause of inefficiency in our example. Line 13 creates a tuple within println(), 
using (,). The two elements of the tuple are return values from the quantile() function which 
compute the 0.01 and 0.99 quantiles of data. Quantiles are covered further in Chapter [4] The lines of 
Listing [T.4]are relatively simpler and in this case performance is better. All of the computation is car- 
ried out in the comprehension in Line 4, within the square brackets []. Writing the code in this way 
allows the Julia compiler to pre-allocate 10% memory spaces for data. Then, applying rand () with 
an argument of 5«10^2, indicating the number of desired random values, allows for faster operation. 
The functionality of rand () is covered in Section [1.5] 





Julia is inherently fast, even if you don't give it much thought as a programmer. However, in 
order to create truly optimized code, one needs to understand the inner workings of the system a 
bit better. There are some general guidelines that you may follow. A key is to think about memory 
usage and allocation as in the examples above. Other issues involve allowing Julia to carry out type 
inference efficiently. Nevertheless, for simplicity, the majority of the code examples of this book 
ignore types as much as possible and don't focus on performance. 


Types and Multiple Dispatch 


Functions in Julia are invoked via multiple dispatch. This means the way a function is executed, 
i.e. its method, is based on the type of its inputs, i.e. its argument types. Indeed functions can have 
multiple methods of execution, which can be checked using the methods () command. 


Julia has a powerful type system which allows for user-defined-types. One can check the type 
of a variable using the typeof () function, while the functions subtype () and supertype () 
return the subtype and supertype of a particular type respectively. As an example, Bool is a subtype 
of Integer, while Real is the supertype of Integer. This is illustrated in Figure [1.4] which shows 
the type hierarchy of numbers in Julia. 
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Figure 1.4: Type hierarchy for Julia numbers. 


One aspect of Julia is that if the user does not specify all variable types in a given piece of code, 
Julia will attempt to infer what types the unspecified variables should be, and will then attempt to 
execute the code using these types. This is known as type inference, and relies on a type inference 
algorithm. This makes Julia somewhat forgiving when it comes to those new to coding, and also 
allows one to quickly mock-up fast working code. It should be noted however that if one wants the 
fastest possible code, then it is good to specify the types involved. This also helps to prevent type 
instability during code execution. 


Variable Scope, Local Variables, and Global Variables 


The scope of a variable is the region of code in which the variable is visible. Like almost any 
other programming language, Julia has rules about variable scope implying that not all variables can 
be accessed from everywhere within the program. Such restrictions serve several purposes including 
driving the programmer to create clear readable code, allowing the compiler to optimize execution, 
supporting concurrent operation, and reducing the possibility of name clashes. 


When discussing scope, a key distinction lies between local variables and global variables. The 
former refers to variables defined within a function definition or a block of code such as a 
for loop or while loop. The latter refers to variables that can potentially accessed from any- 
where in the program. A detailed description of variable scope rules is in the Julia documen- 


tation, see https://docs.julialang.org/en/v1/manual/variables-and-scoping/,| 


Here we only focus on a few key issues that are relevant to many of our code examples. 





In general a global variable is a variable defined outside of a function or another block of code. 
Since global variables have a much less restricted domain than local variables, it is typically good 
programming practice to minimize or even eliminate their use. This is especially true for programs 
that span more than a single file and/or multiple lines of code (hundreds or thousands). Some 
Julia mechanisms that allow to organize variables include structures using the struct keyword 
and modules using the module keyword. However our code examples aim to be short self contained 
scripts and don’t explicitly try to encapsulate data. We don’t explicitly use structures and modules 
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and we define functions only if these are critically needed. Our aim is for simple code that fits 
on a minimal footprint. As a consequence our code examples often use global variables. If you 
wish to integrate parts of our code examples in larger projects, then it is good practice to eliminate 
or significantly reduce the use of globals as you carry out such integration. In less trivial coding 
situations, you should aim to hardly use global variables and the global keyword. Still, for the 
purposes of illustrative code snippets, our heavy use of globals is justified. 


As a minimal introduction to variable scope we present Listing One of the main goals of 
this listing is to show the use of the global keyword which is prevalent in many of the code listings 
that follow. You may wish to skip reading the details of this listing and then refer back to it when 
thinking and considering variable scope. The important point is simply to observe that the global 
keyword is sometimes needed and is thus present in quite a few of the examples that follow. 


The key aspect of Listing [1.5]is the execution of a for loop in global scope followed by a similar 
loop wrapped within a function. The global variables data, s, beta, and gamma are potentially 
visible in all parts of the code, including in the for loop of lines 5-12 and the function sumData (). 
However, in certain cases the global keyword needs to be used to mark the variable as “coming 
from" global scope. This is the case for the variable s that is marked as global in line 7. Interestingly, 
when using a Jupyter notebook for such code (Jupyter notebooks are described in the next section), 
such usage of the global keyword is not needed. The listing presents multiple other aspects of 
variable scope. We detail some of these aspects in the code comments that follow. A full description 
of variable scope rules is in the Julia documentation. For more details, visit: 


Listing 1.5: [Variable scope and the global keyword 


data = [1,2, 3] 
s = 0 
beta, gamma = 2, 1 


for i in 1:length (data) 
oeae (aL, € Ww) 
global s #This usage of the 'global' keyword is not needed in Jupyter 
#But elsewhere without it: 
#ERROR: LoadError: UndefVarError: s not defined 
S += betaxdata[i] 
data[i] *= -1 
end 
# print (i) #Would cause ERROR: LoadError: UndefVarError: i not defined 
printin("\nSum of data in external scope: ", s) 


Oo -1O» 0v 4 C5 F2 — 














function sumData (beta) 

& e 0 ftry adding the prefix global 

for i in l:length (data) 

S += data[i] + gamma 

end 

return s 
end 
println("Sum of data in a function: ", sumData(beta/2)) 
@show s 





12 3 
Sum of data in external scope: 12 
Sum of data in a function: -3 


s = 12 
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In line 1 we define an array, data. It is a variable defined in global scope and is hence a global 
variable. Similarly for the variables s, beta, and gamma in lines 2-3. Lines 5-12 loop over the range 
1:length(data) where in each iteration the variable i takes the next value. The scope of the 
variable i is within the block of the for loop (lines 5-12). Note that if you were to uncomment line 13, 
the attempt to access i at that line would cause an error. Because the for loop is not inside a function, 
it defines a new local scope. This means that accessing the global variable s for modification requires 
an explicit declaration with the global keyword as is done in line 7. For code in Jupyter notebooks 
this can be avoided but otherwise not. Notice however that global declarations are not always 
needed. For example the global variable beta is used in line 10, but as it isn’t modified there is no 
need to declare it as global with global. The variable (array) data also doesn’t need to be declared 
even though the contents of the array is modified in line 11. In lines 16-22 we define the function 
sumData(). Here the name of the function argument is beta and is not the global variable beta. 
Hence when the function is called in line 23, we can pass any argument to it for the local beta. In 
this case the argument is half of the global beta, i.e. a value of 1. Note that we define a local variable 
s in line 17. It is a different variable from the global s defined in line 2. If we were to add a global 
keyword in line 17 then it would be the global variable s. You can try doing that and see how the 
@show macro in line 24 that displays the value of the global s would change. Note again that the 
global variables data and gamma are used inside the body of the function for read only purposes. 





1.2 Setup and Interface 


There are multiple ways to run Julia including the REPL command line interface, Jupyter 
notebooks, the Juno IDE (Integrated Development Environment) on the Atom editor, as well as 
several other working environments. Here we focus on the REPL and Jupyter notebooks as these 
are the most mature environments to date. We also mention that in developing the code examples, 
we used both Jupyter notebooks and the Juno IDE. The latter is also available directly from Julia 
Computing and is packaged as Julia Pro. 


No matter how you run Julia, there is an instance of a Julia kernel running. The running 
kernel contains all of the currently compiled Julia functions, loaded packages, defined variables, and 
objects. You may even run multiple kernels, sometimes in a distributed manner. We first describe 
the REPL and Jupyter notebooks environments. We then describe the package manager which allows 
one to extend Julia’s basic functionality by installing additional packages. 


REPL Command Line Interface 


The Read Evaluate Print Loop (REPL) command line interface is a simple and straight forward 


way of using Julia. It can be downloaded directly from: https://julialang.org/downloads/ 


Downloading it implies downloading the Julia Kernel as well. 


Once installed locally, Julia can be launched and the Julia REPL will appear, within which Julia 
commands can be entered and executed. For example, in Figure the code 1+2 was entered, 
followed by the enter key. Note that if Julia is launched as its own stand alone application, a new 
Julia instance will appear. However, if you are working in a shell/command-line environment, the 
REPL can also be launched from within the current environment. 
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Documentation: https://docs.julialang.org 


Type "?" for help, "]?" for Pkg help. 


Version 1.3.0 (2019-11-26) 
Official https://julialang.org/ release 





Figure 1.5: Julia’s REPL interface. 


When using the REPL, typically one will also work with Julia files which have the . 31 extension. 
In fact, every code listing in this book is stored in such a file. These files are available on the book’s| 
GitHub 


Jupyter Notebooks 


An alternative to using the REPL is to use a Jupyter Notebook as presented in Figure [1.6] It is 
a browser based interface in which one can type and execute Julia code, as well as Python, R, and 
other languages. Jupyter notebooks are easy to use and allow one to combine code, output, visuals, 
and markdown formatting all together in one document. A Jupyter notebook is both a means of 
presentation and execution. 


Each notebook consists of a series of cells, in which code can be typed and run. Cells can be 
of different type. Code cells allow Julia code to be entered and executed, while markdown cells 
allow for formatting of the document in Markdown, which is a simple formatting language that also 
incorporates hyperlinks, images, and ATEX formatting for formulas. 


Jupyter notebook files have the .ipynb extension. The content of notebooks can also be 
exported as PDF and other formats. À common way to run Jupyter for Julia is using the Anaconda 
Python distribution which installs a Jupyter notebook server locally. A technical note is that the 
IJulia (Julia) package is required for Julia to work within Jupyter notebooks. More on packages 
below. Another advantage of Jupyter notebooks is that because they are browser based, they can 
be configured to run over a remote connection. 


'The user interface for using Jupyter notebooks is easy to learn. When starting, note that there 
are two input modes. Edit mode allows code/text to be entered into a cell, while command mode 
allows keyboard-activated actions, such as toggling line numbering, copying cells, and deleting cells. 
Cells can be executed by first selecting the cell and then pressing ctrl-enter or shift-enter. 
In command mode, additional cells can be created by pressing a or b to create cells above or below 
respectively. 
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Figure 1.6: An example of a Jupyter notebook accessed via JuliaBox. 


The Package Manager 


Although Julia comes with many built-in features, the core system can be extended. This is 
done by installing packages, which can be added to Julia at your discretion. This allows users 
to customize their Julia installation depending on their needs, and at the same time offers sup- 
port for developers who wish to create their own packages, enriching the Julia ecosystem. Note 
that packages may be either registered, meaning that they are part of the Julia package reposi- 
tory, or unregistered, meaning they are not. A list of currently registered packages is available at: 


https://julialang.org/packages/ 


When using the REPL you can enter the package manager mode by typing “]”. This mode 
can be exited by hitting the backspace key. In this mode, packages can be installed, updated, or 
removed. The following lists a few of the many useful commands available: 


add Foo adds the package Foo. 431 to the current Julia build. 
status lists what packages and versions are currently installed. 
update updates existing packages. 

remove Foo removes package Foo. j1 from the current Julia build. 





An alternative which works both in the REPL and in Jupyter notebooks is to use functions from 
the Pkg package. The standard usage is of the form using Pkg followed by Pkg.add ("Foo"). 
This adds the package Foo.j1. Similar functions exist via the Pkg package for other package 
operations. 


As you study the code examples in this book, you will notice that most start with the using 
command, followed by a package name. This is how Julia packages are loaded into the current 
namespace of the kernel, so that the package's functions, objects, and types can be used. Note 
that writing using does not imply installing a package. Installation of a package is a one-time 
operation which must be performed before the package can be used. In comparison, typing the 
keyword using is required every time functionality of a package is required. 
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Packages Used in This Book 


The code in this book uses a variety of Julia packages. Some of the key packages used in 
the context of probability, statistics and machine learning are DataFrames, Distributions, 
Flux, GLM, Plots, Random, Statistics, StatsBase, and StatsPlots as well as many other 
important packages. Some of these are built-in with the base installation, for example Statistics 
and Random, while others require user installation via the package manager as described above. 
A short description of each of the packages that we use in the book is contained below. We have 
placed a ‘*’ next to every package that is part of the basic installation. 


is the basic Julia package sitting at the base of the language. 
allows to store and read data using the common Binary JSON format. 


Calculus. jl provides tools for working with basic calculus operations including differentiation 


and integration both numerically and symbolically. 
[CategoricalArrays. j1| provides tools for working with categorical variables. 
provides support for various clustering algorithms. 
allows to enumerate combinatorics and permutations. 
is a utility library for working with CSV and other delimited files in Julia. 
is a package for working with tabular data. 
provides support for various types of data structures. 
|*Dates. j1 provides support for working with dates and times. 
[DecisionTree. jllis a package for decision trees and random forest algorithms. 
is a suite which provides efficient Julia implementations of nu- 


merical solvers for various types of differential equations. 
provides support for working with probability distributions. 
is a machine learning library written in pure Julia. 
is a package on linear models and generalized linear models. 
is an implementation of multidimensional “h-adaptive” (numerical) integration. 
implements a wide range of hypothesis tests and confidence intervals. 
provides HTTP client and server functionality. 
[rJulia.jl|is required to interface Julia with Jupyter notebooks. 
(Images. jl|is an image processing library. 
is a package for parsing and printing JSON. 
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is a package needed for using the Juno development envionrment. 

is a kernel density estimation package. 

Lasso. j1 implements LASSO model fitting.. 

[LaTeXStrings. j1|makes it easier to type LaTeX equations in string literals. 

[LIBSVM. j1|is a package for Support Vector Machines (SVM) using the LIBSVM library. 
provides support for the implementation of graphs in Julia. 
|*LinearAlgebra. jl|provides linear algebra support. 

allows building up and representing expressions involving differing types of units 


that are then evaluated, resolving them into absolute units. 


MLDatasets.jl| provides an interface for accessing common Machine Learning (ML) datasets.. 


MultivariateStats.jl is a package for multivariate statistics and data analysis, including 
ridge regression, PCA, dimensionality reduction and more. 


NLsolve.jl| provides methods to solve non-linear systems of equations. 


Plots.jl| is one of the main plotting packages in the Julia ecosystem. It is the main plotting 
package used throughout our book. 


PyCall.jl provides the ability to directly call and fully interoperate with Python Julia. 


PyPlot.jl provides a Julia interface to the Matplotlib plotting library from Python, and specif- 
ically to the matplotlib.pyplot module. 


QuadGK. jl) provides support for one-dimensional numerical integration using adaptive Gauss- 
Kronrod quadrature. 


provides support for pseudo random number generation. 
RCall.jl| provides several different ways of interfacing with R from Julia. 


RDatasets.jl| provides an easy way to interface with the standard datasets that are available in 
the core of the R language, as well as several datasets included in R’s more popular packages. 


Roots. jl, contains routines for finding roots of continuous scalar functions of a single variable. 
SpecialFunctions.jl|contains various special mathematical functions, such as Bessel, zeta, 


digamma, along with sine and cosine integrals, as well as others. 


«Statistics. jl contains common statistics functions such as mean and standard deviation. 


StatsBase.jl| provides basic support for statistics including high-order moment computation, 
counting, ranking, covariances, sampling and cumulative distribution function estimation. 


StatsModels. jl, allows to specify models using formulas as common in linear models. 
StatsPlots.jl|provides extensive statistical plotting recipes. 
TimeSeries.jl|provides support for working with time series data. 


Many additional useful packages, not employed in our code examples are in Appendix [C] 
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1.3 Crash Course by Example 


Almost every procedural programming language needs functions, conditional statements, loops 
and arrays. Similarly, every scientific programming language needs to support plotting, matrix 
manipulations, and floating point calculations. Julia is no different. In this section we present 
several examples, and through them begin to explore various basic programming elements. Each 
example aims to introduce another aspect of Julia. These examples are not necessarily minimal 
examples needed for learning the basics of Julia, nor do they build statistical foundations from the 
ground up. Rather, they are designed to show what can be done with Julia. Hence if you find these 
examples too complex from either a programming or a mathematical perspective, feel free to skip 
directly to Chapter 2, where basic probability is demonstrated via simple examples from the ground 


up. 


Alternatively, if you prefer to engage with the language through more simple examples, you may 
wish to use other resources alongside this book. If you are a beginner to programming, we recom- 
mend the introductory book to programming with the Julia language, “Think Julia — How to Think 
Like a Computer Scientist” by A. Downey, B. Lauwens [DL19]. If you are a seasoned programmer and 
are looking for a more general purpose text about Julia, see “Julia 1.0 Programming Cookbook” by 


B. Kaminski, P. Szufel [KS18]. Another option is to visitihttps://julialang.org/learning/ 


for a variety of other resources. 


In addition to the general Julia programming resources mentioned above, there are also several 
other texts that are worth considering for specific aspects of scientific computing, data science 
and artificial intelligence. The book provides an exhaustive introduction to optimization 
algorithms together with Julia code. The book, focuses on operations research using Julia. 
Finally, the book is an applied data science resource, as is [V16]. 


We now present some select examples which are designed to illustrate basic programming (bubble 
sort), show simple numerical computation (roots of a polynomial), provide examples of how to work 
with matrices and randomness (Markov chain), and show how one can interface with the web and 
do basic text processing. 


Bubble Sort 


In our first example, we construct a basic sorting algorithm using first principles. The algorithm 
we consider here is called Bubble Sort. This algorithm takes an input array, indexed 1,...,n, then 
sorts the elements smallest to largest by allowing the larger elements, or “bubbles”, to “percolate 
up”. The algorithm is implemented in Listing As can be seen from the code, the locations j 
and j + 1 are swapped inside the two nested loops. This maintains an increasing (non-decreasing) 
order in the array. The conditional statement if is used to check if the numbers at indexes j and 
j +1 are in increasing order, and if needed, swap them. 
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Listing 1.6: |Bubble sort 


function bubbleSort! (a) 
n = length (a) 
foramine ll 
for j in ln i 
As ab eje 
ell; elsi] = el], ala] 
end 
end 
end 
return a 
end 


cara = (65, Sil, 32, 12, 29. 84, 60, i] 
bubbleSort! (data) 





8-element Array{Int64,1}: 
il 

12 

23 

32 

51 

65 

68 

84 





In lines 1-11, we define a function, named bubbleSort! (). The input argument a is implicitly 
expected to be an array. The function sorts a in place, and returns a reference to the array. Note 
that in this case, the function name ends with *" by convention. This exclamation mark decorates 
the name of the function, letting us know that the function argument, a, will be modified (a is sorted 
in place without memory copying). In Julia, arrays are passed by reference. Arrays are indexed from 
1 to the length of the array, obtained by length (). In line 6 the elements a[3] and a[3+1] are 
swapped by using assignment of the form m,n = x,y which is syntactic shorthand for m=x followed 
by n=y. In line 14, the function is called on data. As it is the last line of the code block and is not 
followed by a ‘;’, the expression evaluated in that line is presented as output, in our case the sorted 
array. Note that it has a type Array{Int64,1}, meaning an array of integers. Julia inferred this 
type automatically. Try changing some of the values in line 18 to floating points, eg. [65.0, 51.0 
. (etc)] and see how the output changes. 





Keep in mind that Julia already contains standard sorting functions such as sort () and 
sort! (), so you don’t need to implement your own sorting function as we did. For more in- 
formation on these functions use ?sort. Also, the bubble sort algorithm is not the most efficient 
sorting algorithm, but is introduced here as a means of understanding Julia better. For an input 
array of length n, it will execute line 5 about n?/2 times. For non-small n, this is much slower 
performance than optimal sorting algorithms where the number of comparisons can be reduced to 
an order of nlog(n) times. 
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Roots of a Polynomial 


We now consider a different type of programming example that comes from elementary numerical 
analysis. Consider the polynomial, 


f(z) = ana” -- a4 1273 +... + a£ + ao, 


with real-valued coefficients ap,...,dn. Say we wish to find all x values that solve the equation 
f(x) = 0. We can do this numerically with Julia using the find_zeros () function from the 
Roots package. This general purpose solver takes a function as input and numerically tries to find 
all its roots within some domain. As an example, consider the quadratic polynomial, 


f(x) = —102? + 3x +1. 


Ideally, we would like to supply the find_zeros () function with the coefficient values, —10, 3 
and 1. However, find_zeros () is not designed for a specific polynomial, but rather for any Julia 
function that represents a real mathematical function. Hence one way to handle this is to define a 
Julia function specifically for this quadratic f(x) and give it as an argument to find_zeros(). 
However, here we will take this one step further, and create a slightly more general solution. We 
first create a function called polynomialGenerator which takes a list of arguments representing 
the coefficients, an, 4n—1,...,4pg and returns the corresponding polynomial function. We then use 
this function as an argument to the find zeros() function, which then returns the roots of the 





original polynomial. 


Listing shows our approach. Note that for our example it is straightforward to solve the 
roots analytically and verify the code. This is done using the quadratic formula as follows, 


Bd. 347 
T 2(—10) 20 2 da 











Listing 1.7: |Roots of a polynomial 


using Roots 


function polynomialGenerator (a...) 
n — length(a)-1 
poly = function (x) 
return sum([a[it+tl]*x*i for i in 0:n]) 
end 
return poly 


1 
2 
3 
4 
5 
6 
Y 
8 


end 


polynomial = polynomialGenerator (1,3,-10) 
zeroVals = find zeros (polynomial,-10,10) 
println("Zeros of the function f(x): ", zeroVals) 





Zeros of the function f(x): [-0.2, 0.5] 
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In line 1 we employ the using keyword, indicating to include elements from the package Roots. Note 
that this assumes that the package has already been added as part of the Julia configuration. Lines 3-9 
define the function polynomialGenerator (). An argument, a, along with the splat operator ... 
indicates that the function will accept a comma separated list of parameters of unspecified length. 
For our example we have three coefficients, specified in line 11. Line 4 makes use of the length () 
function to determine how many arguments were given to the function polynomialGenerator (). 
Notice that the degree of the polynomial, represented in the local variable n is one less than the 
number of arguments. Lines 5-7 define an internal function with an input argument x, and then 
stores this function as the variable pol y, returned from polynomialGenerator (). One can pass 
functions as arguments, and assign them to variables. The main workhorse of this function is line 6, 
where the sum() function is used to sum over an array of values. This array is implicitly defined 
using a comprehension. In this case, the comprehension is [a[it1]*x^i for i in 0:n]. This 
creates an array of length n + 1 where the ith element of the array is a{it+1]*x*i. In line 12 the 
find_zeros() function from the Roots package is used to find the roots of the polynomial. The 
latter arguments are guesses for the roots which are used for initialization. The roots calculated are 
then assigned to zeroVals and the output printed. 





Steady State of a Markov Chain 


We now introduce some basic linear algebra computations and simulation through a simple 
Markov chain example. Consider a theoretical city, where the weather is described by three possible 
states: (1) ‘Fine’, (2) ‘Cloudy’ and (3) ‘Rain’. On each day, given a certain state, there is a proba- 
bility distribution for the weather state of the next day. This simplistic weather model constitutes 
a discrete time (homogeneous) Markov chain. This Markov chain can be described by the 3 x 3 
transition probability matriz, P, where the entry P; j indicates the probability of transitioning to 
state j given that the current state is i. The transition probabilities are illustrated in Figure [1.7] 


One important computable quantity for such a model is the long term proportion of occupancy 
in each state. That is, in steady state, what proportion of the time is the weather in state 1, 2 or 3. 
Obtaining this stationary distribution, denoted by the vector m = [71, 72,73] (or an approximation 
for it) can be achieved in several ways, as shown in Listing [1.8] For pedagogical and exploratory 
reasons we use four methods to find the stationary distribution. Note that some of these methods 
involve linear algebra and/or results from the theory of Markov chains. These are not covered here, 
but rather discussed in Section [10.2] of Chapter If you haven’t been exposed to linear algebra, 


we suggest you only skim through this example. The four methods that we use are: 


1. By raising the matrix P to a high power, (repeated matrix multiplication of P with itself), 
the limiting distribution is obtained in any row. Mathematically, 


ms lim [PMs for any index, j. (1.1) 


n—>00 


2. We solve the (overdetermined) linear system of equations, 


3 
TP = and 5o e. (1.2) 
i=1 
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Figure 1.7: Three-state Markov chain of the weather. 
Notice the sum of the arrows leaving each state is 1. 


This linear system of equations can be reorganized into a system with 3 equations and 3 
unknowns by realizing that one of the equations inside nP = m is redundant. Written out 
explicitly we have: 
Pi=1 Pa —Por||m 
P2 Pa=1 Poo) are) = 
1 1 1 T3 


(1.3) 


Roo 


. By making use of the Perron Frobenius Theorem which implies the eigenvector corresponding 


to the eigenvalue of maximal magnitude is proportional to 7, we find this eigenvector and 
normalize it by the sum of probabilities (L4 norm). 


. We run a simple Monte Carlo simulation (see also Section [1.5) by generating random values 


of the weather according to P, and take the long term proportions of each state. In contrast 
to the previous three approaches, this approach does not require any linear algebra. 


The output shows that the four estimates of the vector 7 are very similar. Each column represents 


the stationary distribution obtained from methods 1 to 4, while the rows represent the stationary 
probability of being in each state. 
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Listing 1.8: ¡Steady state of a Markov chain in several ways 


using LinearAlgebra, StatsBase 


# Transition probability matrix 
p = q.s 0,4 ip 


OR 
OR 


# First way 
plural = (2100) Pl, 821 


Second way 
= weat((P" = lt s rones toh 
EO Q 1] 
piProb2 = Alb 


# Third way 

igVecs = eigvecs (copy (P')) 
highestVec = eigVecs[:,findmax (abs. (eigvals(P))) [2]] 
piProb3 = Array{Float64} (highestVec) /norm(highestVec, 1) 








# Fourth way 
numInState = zeros (Int, 3) 
state = 1 
N = 10%6 
fors t in 1N 
numInState[state] += 1 
global state = sample(1:3,weights(P[state,:])) 
end 
piProb4 = numInState/N 





display ([piProbl piProb2 piProb3 piProb4]) 





3x4 Array{Float64,2}: 

0.4375 0.4375 0.4375 0.437521 
0.3125. 0.3125. 0.39125 0.312079 
0.25 0.25 0.25 0.2504 
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In lines 4-6 the transition probability matrix P is defined. The notation for explicitly defining a 
matrix in Julia is the same as that of Matlab. In line 9, |(1.1)|is implemented and n is taken as 
100 (approximating oo). The first row of the resulting matrix is obtained via [1,:]. Note that 
using [2,:] or [3,:] instead will approximately yield the same result, since the limit in equation 
[(1.1)] is independent of j. Lines 12-14 use quite a lot of matrix operations to set up the system of 
equations (1-3). The use of vcat () (vertical concatenation) creates the matrix on the left hand side 
by concatenating the 2x3 matrix, (P’ - I) [1:2,:] with a row vector of 1's, ones (3)’. Note the 
use of I which is the identity matrix. Finally, the solution is found by using A\b in the same fashion 
as Matlab for solving linear equations of the form Ax = b. In lines 17-19 the built-in eigvecs () and 
eigvals () functions from LinearAlgebra are used to find the eigenvalues and a set of eigenvectors 
of P respectively. The findmax() function is then used to find the index matching the eigenvalue 
with the largest magnitude. Note that the absolute value function abs () works on complex values 
as well. Also note that when normalizing in line 19, we use the L4 norm which is essentially the sum 
of absolute values of the vector. In lines 22-29 a direct Monte Carlo simulation of the Markov chain 
is carried out through a million iterations and modifications of the state variable. We accumulate 
the occurrences of each state in line 26. Line 27 is the actual transition, which uses the sample () 
function from the StatsBase package. At each iteration the next state is randomly chosen based on 
the probability distribution given the current state. This is done via the use of weight vector. Note 
that the normalization from counts to frequency in line 29, uses the fact that Julia casts integer counts 
to floating point numbers upon division. That is, both the variables numInState and N are an array 
of integers and an integer respectively, but the division (vector by scalar) makes piProb4 a floating 
point array. 








Web Interfacing, JSON and String Processing 


We now look at a different type of example which deals with text. Imagine that we wish to 
analyze the writings of Shakespeare. In particular, we wish to look at the occurrences of some 
common words in all of his known texts and present a count of a few of the most common words. 
One simple and crude way to do this is to pre-specify a list of words to count, and then specify how 
many of these words we wish to present. 


To add another dimension to this problem we will use a JSON (Java Script Object Notation) 
file. This file format is widely used for storing hierarchical datasets both in data science and web 
development, hence the name. We use the below JSON file in the example that follows. 


{ 


"words" : [ "heaven", "hell", "man", "woman", "boy", Wear, "king", "queen", 
"prince", "sir","love", "hate", "knife", "english", "england", "god"], 
"numToShow": 5 


The JSON format uses ‘{ }’ characters to enclose a hierarchical nested structure of key value 
pairs. In the example above there isn't any nesting, but rather only one top level set of ‘{ Y. 
Within this there are two keys: words and numToShow. Treating this as a JSON object means 
that the key numToShow has an associated value 5. Similarly, words is an array of strings, with 
each element a potentially interesting word to consider in Shakespeare's texts. In general, JSON 
files are used for much more complex descriptions of data, but here we use this simple structure for 
illustration. 
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Now with some basic understanding of JSON, we can proceed with our example. The code in 
Listing retrieves Shakespeare’s texts from the web and then counts the occurrences of each of 
the words, ignoring case. We then show a count for each of the numToShow most common words. 


Listing 1.9: |Web interface, JSON and string parsing 


using HTTP, JSON 








data = HTTP.request ("GET", 
"https://ocw.mit.edu/ans7870/6/6.006/s08/1ecturenotes/files/t8.shakespeare.txt") 
shakespeare = String (data.body) 

shakespeareWords = split (shakespeare) 











jsonWords = HTTP.request ("GET", 
"https://raw.githubusercontent.com/"x 
"h-Klok/StatsWithJuliaBook/master/1_chapter/jsonCode. json") 
parsedJsonDict = JSON.parse( String(jsonWords.body) ) 


keywords = Array{String} (parsedJsonDict ["words"]) 

numberToShow = parsedJsonDict ["numToShow"] 

wordCount = Dict ([(x,count (w -> lowercase(w) == lowercase(x), shakespeareWords) ) 
for x in keywords] ) 





sortedWordCount = sort (collect (wordCount) by=last, rev=true) 
sortedWordCount [1:numberToShow] 





5-element Array{Pair{String,Int64},1}: 
"king"=>1698 
"love"=>1279 


"man"=>1033 
"sir"=>721 
"god"=>555 





In lines 3-4 HTTP. request from the HTTP package is used to make a HTTP request. In line 5 the 
body of data is then parsed to a text string via the String () constructor function. In line 6 this 
string is then split into an array of individual words via the split () function.In lines 8-11 the JSON 
file is first retrieved. Then this string is parsed into a JSON object. The URL string for the JSON 
file doesn't fit on one line, so we use * to concatenate strings. In line 11 the parse () function from 
the JSON package is used to parse the body of the file and creates a dictionary. Line 13 shows the 
strength of using JSON as the value associated with the JSON key words is accessed. This value (i.e. 
array of words) is then cast to an Array{String} type. Similarly, the value associated with the key 
numToShow is accessed in line 14. In line 15 a Julia dictionary is created via Dict (). It is created 
from a comprehension of tuples, each with x (being a word) in the first element, and the count of 
these words in shakespeareWords as the second element. In using count we define the anonymous 
function as the first argument that compares an input test argument w to the given word x, only in 
lowercase. Finally line 18 sorts the dictionary by its values, and line 19 displays as output the first 
most popular numberToShow values. 
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Figure 1.8: An introductory Plots example. 


1.4 Plots, Images and Graphics 





There are many different plotting packages available in Julia, including PyPlot, Gadfly, 
Makie as well as several others. Arguably, as a starting point, two of the most useful plotting 
packages are the Plots package and the StatsPlots package. Plots simplifies the process of 
creating plots, as it brings together many different plotting packages under a single API. With 
Plots, you can learn a single syntax, and then use the backend of your choice to create plots. 
Almost all of the examples throughout this book use the Plots package, and in almost all of the 
examples the code presented directly generates the figures. That is, if you want examples of how 
to create certain plots, one way of doing this is to browse through the figures of the book until you 
find one of interest, and then look at the associated code block and use this as inspiration for your 
plotting needs. 


In Plots, input data is passed positionally, while aspects of the plot can be customized by 
specifying keywords for specific plot attributes, such as line color or width. In general, each attribute 
can take on a range of values, and in additon, many attributes have aliases which empower one to 
write short, conscise code. For example color=:blue can be shortened to c=:blue, and we make 
use of this alias mechanic throughout the books examples. 


Since the code listings from this book can be used as direct examples, we don’t present an exten- 
sive tutorial on the finer aspects of creating plots. Rather, if you are seeking detailed instructions 
or further references on finer points, we recommend that you visit: 


http://docs.juliaplots.org 


As a minimal overview, the following is a brief list of some of the more commonly used Plots 
package functions for generating plots: 


plot () - Can be used to plot data in various ways, including series data, single functions, multiple 
functions, as well as for presenting and merging other plots. This is the most common plotting 
function. 


scatter () - Used for plotting scattered data points not connected by a line. 
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bar() - Used for plotting bar graphs. 


heatmap () - Used to plot a matrix, or an image. 





surface () - Used to plot surfaces (3D plots). This is the typical way in which one would plot a 
real valued function of two variables. 


contour() - Used to create a contour plot. This is an alternative way to plot a real valued 
function of two variables. 








contourf() - Similar to contour (), but with shading between contour lines. 





histogram() - Used to plot histograms of data. 


stephist () - A stepped histogram. This is a histogram with no filling. 


oy? 


In addition, each of these functions also has a companion function with a suffix, for e.g. 
plot! (). These functions modify the previous plot, adding additional plotting aspects to them. 
This is shown in many examples throughout the book. Furthermore, the Plots package supplies 
additional important functions such as savefig() for saving a plot, annotate! () for adding 
annotations to plots, default () for setting plotting default arguments, and many more. Note that 
in the examples throughout this book pyplot () is called. This activates the PyP1ot backend for 
plotting. 








As a basic introductory example focused solely on plotting, we present Listing In this 
listing, the main object is the real valued function of two variables, f(x,y) = 1? + y?. We use this 
quadratic form as a basic example, and also consider the cases of y = 0 and y = 2. The code generates 
Figure Note the use of the LaTeXStrings package enabling IATEX formatted formulas. See 
for example, http://tug.ctan.org/info/undergradmath/undergradmath. pdi 











Listing 1.10: |Basic plotting 


using Plots, LaTeXStrings, Measures; pyplot () 


EN sy) = E82 4p VO 
f0(x) = £(x,0) 
LaS) = a2 (be 2) 


meds. yWels = —95:0.185 y =530.i185 
plot (xVals, [f0.(xVals), f2.(xVals)], 
c=[:blue :red], xlims=(-5,5), legend=:top, 
Vihaims=(—5,25) 5 Milalos le (ez, Neccar Y, dels Dose (E, 0) Y es (E, 2) w 
pl = annotate! (0, Ma) cext (OPO) The minimumin ofr E(x 0), Glee) Ses, LON 





m= || El) os y da vWedle, $ n EST 
p2 = surface(xVals, yVals, z, c=cgrad([:blue, :red]),legend-:none, 
Vilalssla"y", cbe (s) 0) 





eeu p s 
heatmap(M, c=cgrad([:blue, :red]), yflip-true, ylabel="y", 
cias (Cd sdb02 |, Vals), weickss([lelOpi, seeds) ) 


plot(pl, p2, p3, layout=(1,3), size-(1200,400), xlabel="x", margin=5mm) 
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Line 1 includes the following packages: Plots for plotting; LaTeXStrings for displaying labels 
using IATEX formatting as in line 10; and Measures for specifying margins such as in line 21. In 
line 1, as part of a second statement following ‘;’, pyplot () is called to indicate that the PyPlot 
plotting backend is activated. In line 3 we define the two variable real valued function f () which 
is the main object of this example. We then define two related single variable functions, £0() and 
£2() Le. f(x,0) and f(x, 2). In line 7 we define the ranges xVals and yVals. Line 8 is the first call 
to plot () where xVals is the first argument indicating the horizontal coordinates, and the array 
[£0.(xVals), £2. (xVals)] represents two data series to be plotted. Then in the same function 
call on lines 9 and 10, we specify colors, x-limits, y-limits, location of the legend, and the labels, 
where L denotes LaTeX. In line 11 annotate! () modifies the current plot with an annotation. The 
return value is the plot object stored in p1. Then in lines 13-15 we create a surface plot. The ‘height’ 
values are calculated via a two way comprehension and stored in the matrix z on line 13. Then 
surface () is used in lines 14-15 to crate the plot, which is then stored in the variable p2. Note the 
use of the cgrad() function to create a color gradient. In lines 17-19 a matrix of values is plotted 
via heatmap (). The argument yflip=true is important for orienting the matrix in the standard 
manner. Finally, in line 21 the three previous subplots are plotted together as a single figure via the 
plot () function. 





Histogram of Hailstone Sequence Lengths 


In this example we use Plots to create a histogram in the context of a well-known mathematical 
problem. Consider that we generate a sequence of numbers as follows: given a positive integer z, 
if it is even, then the next number in the sequence is 2/2, otherwise it is 3x +1. That is, we start 
with some zo and then iterate tn41 = f(a) with 


i2 if x mod 2 = 0, 
f(z) = 
3r+1 ifz mod2=1. 


The sequence of numbers arising from this function is called the hailstone sequence. As an 
example, if zo = 3, the resulting sequence is, 


3,10, 5,16,8,4,2,1,..., 


where the cycle 4,2, 1 continues forever. We call the number of steps (possibly infinite) needed to 
hit 1 the length of the sequence, in this case 8. Note that different values of xo will result in different 
hailstone sequences of different lengths. 


It is conjectured that, regardless of the zo chosen, the sequence will always converge to 1. That is, 
the length is always finite. However, this has not been proven to date and remains an open question, 
known as the Collatz conjecture. In addition, a counter-example has not yet been computationally 
found. That is, there is no known xy for which the sequence doesn’t eventually go down to 1. 


Now that the context of the problem is set, we create a histogram of lengths of hailstone sequences 
based on different values of zog. Our approach is shown in Listing [1.11] where we first create a 
function which calculates the length of a hailstone sequence based on a chosen value of zo. We then 
use a comprehension to evaluate this function for each value, x9 = 2,3,...,107, and finally plot a 
histogram of these lengths, shown in Figure [1.9] 
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Figure 1.9: Histogram of hailstone sequence lengths. 


Listing 1.11: |Histogram of hailstone sequence lengths 


using Plots; pyplot () 


function hailLength(x::Int) 


1 
D 
3 
4 
5 
6 
Y 
8 


return n 
end 


lengths = [hailLength(x0) for x0 in 2:10^7] 


histogram(lengths, bins=1000, normed=:true, 
fill=(:blue, true), la=0, legend=:none, 
AIN E II 
xlabel="Length", ylabel="Frequency") 











In lines 3-14 the function hailLength() is created, which evaluates the length of a hailstone se- 
quence, n, given the first number in the sequence, x. Note the use of : : Int, which indicates the 
method implemented operates only on integer types. A while loop is used to sequentially and repeat- 
edly evaluate all code contained within it, until the specified condition is false. In this case until 


we obtain a hailstone number of 1. Note the use of the not-equals comparison operator, !=. In line 
6 the modulo operator, $, and equality operator, ==, are used in conjunction to check if the current 


number is even. If true, then we proceed to line 7, else we proceed to line 9. In line 11 our hailstone 
sequence length is increased by one each time we generate a new number in our sequence. In line 13 
length of the sequence is returned. In line 16 a comprehension is used to evaluate our function for 
integer values of xg between 2 and 10’. In lines 18-21 the histogram() function is used to plot a 
histogram using an arbitrary bin count of 1000. 
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Creating Animations 


We now present an example of a live animation which sequentially draws the edges of a fully- 
connected mathematical graph. A graph is an object that consists of vertices, represented by dots, 
and edges, represented by lines connecting the vertices. 


In this example we construct a series of equally spaced vertices around the unit circle, given an 
integer number of vertices, n. To add another aspect to this example, we obtain the points around 
the unit circle by considering the complex numbers, 


in = eimi for k= Ly (1.4) 


We then use the real and imaginary parts of z, to obtain the horizontal and vertical coordinates 
for each vertex respectively, which distributes n points evenly on the unit circle. The example in 
Listing [1.12] sequentially draws all possible edges connecting each vertex to all remaining vertices, 
and animates the process. Each time an edge is created, a frame snapshot of the figure is saved, 
and by quickly cycling through the frames generated, we can generate an animated GIF. A single 
frame approximately half way through the GIF animation is shown in Figure [1.10] 


Listing 1.12: |Animated edges of a graph 


using Plots; pyplot () 


function graphCreator (n::Int) 
vertices - 1:n 
complexPts = [exp(2*«piximxk/n) for k in vertices] 
coords = [(real(p),imag(p)) for p in complexPts] 
xPts = first. (coords) 
yPts = last. (coords) 
edges = [] 
for v an vertices, ù in (sn 
push! (edges, (v, u)) 
end 


anim = Animation () 
scatter (xPts, yPts, c=:blue, msw=0, ratio=1, 
xlims=(-1.5,1.5), ylims=(-1.5,1.5), legend=:none) 


for i in 1:length (edges) 
u, v = edges[i][1], edges[i] [2] 
xpoints = [xPts[u], xPts[v]] 
ypoints = [yPts[u], yPts[v]] 
plot!(xpoints, ypoints, line-(:red)) 
frame (anim) 

end 


gif(anim, "graph.gif", fps = 60) 





graphCreator (16) 
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Figure 1.10: Sample frame from a graph animation. 





The code defines the function graphCreator(), which constructs the animated GIF based on 
n number of vertices. In line 5 the complex points calculated via are stored in the array 
complexPoints. In line 6 real() and imag() extract the real and imaginary parts of each 
complex number respectivly, and store them as paired tuples. In lines 7-8, the x and y coordinates are 
retrieved via first () and last () respectively. Note lines 5-8 could be shortened and implemented 
in various other ways, however the current implementation is useful for demonstrating several aspects 
of the language. Then lines 10-12 loop over u and v, and in line 11 the tuple (u, v) is added to edges. 
In line 14 an Animation () object is created. The vertices are plotted in lines 15-16 via scatter(). 
The loop in lines 18-24 plots a line for each of the edges via plot! (). Then frame(anim) adds 
the current figure as another frame to the animation object. The gif () function in line 26 saves the 
animation as the file graph.gif where fps defines how many frames per second are rendered. 





Raster Images 


We now present an example of working with raster images, namely images composed of individual 
pixels. In Listing [1.13] we load a sample image of stars in space and locate the brightest star. Note 
that the image contains some amount of noise, in particular as seen from the output, the single 
brightest pixel is located at [192,168] in row major. Therefore if we wanted to locate the brightest 
star by a single pixel’s intensity, we would not identify the correct coordinates. 


Since looking at single pixels can be deceiving, to find the highest intensity star, we use a simple 
method of passing a kernel over the image. This technique smoothes the image and eliminates some 
of the noise. The results are in Figure [1.11] where the two subplots show the original image vs. the 
smoothed image, and the location of the brightest star for each. 
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Figure 1.11: Left: Original image. 
Right: Smoothed image after noise removal. 


Listing 1.13: [Working with images 


using Plots, Images; pyplot() 


img = load("../data/stars.png") 
glImg = red. (img) «0.299 + green.(img)*0.587 + blue. (img)*0.114 
rows, cols = size(img) 


println("Highest intensity pixel: ", findmax(gImg)) 


function boxBlur(image,x,y,d) 
if x«-d || y<=d || x>=cols-d || y>=rows-d 
return image[x,y] 
else 
corel = 0.0 
for xi = x-d:x+d 
for yi = y-d:ytd 
total += image[xi,yi] 
end 
end 
return total/((2d*1)^2) 
end 
end 


pimerimg -boxe llun (clima, 2, W 5) sere 2x ía Lecols, y sha 1 8:295] 


yOriginal, xOriginal argmax (gImg) .1I 
yBoxBlur, xBoxBlur argmax (blurImg).I 


pl heatmap(glImg, c=:Greys, yflip=true) 

pl scatter! ((xOriginal, yOriginal), ms=60, ma=0, msw=4, msc=:red) 
p2 heatmap (blurImg, c=:Greys, yflip=true) 

p2 scatter! ((xBoxBlur, yBoxBlur), ms=60, ma=0, msw=4, msc=:red) 














plot (p1, p2, size=(800, 400), ratio=:equal, xlims=(0,cols), ylims=(0,rows), 
colorbar_entry=false, border=:none, legend=:none) 








Highest intensity pixel: (0.9999999999999999, CartesianIndex(192, 168)) 
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In line 3 the image is read into memory via the load() function and stored as img. Since the image 
is 400 x 400 pixels, it is stored as a 400 x 400 array of RGBA tuples of length 4. Each element of these 
tuples represents one of the color layers in the following order: red, green, blue, and luminosity. In 
line 4 we create a grayscale image from the original image data via a linear combination of its RBG 
layers. This choice of coefficients is a common “Grayscale algorithm”. The gray image is stored as the 
matrix gImg. In line 5 the size() function is used to determine then number of rows and columns 
of gImg, which are then stored as rows and cols respectively. In line 7 findmax () is used to find 
the highest intensity element (pixel) in gImg. It returns a tuple of value and index, where in this case 
the index is of type CartesianIndex because gImg is a two dimensional array (matrix). In lines 
9-21 the function boxBlur is created. This function takes an array of values as input, representing 
an image, and then passes a kernel over the image data, taking a linear average in the process. This is 
known as “box blur”. In other words, at each pixel, the function returns a single pixel with a brightness 
weighting based on the average of the surrounding pixels (or array values) in a given neighborhood 
within a box of dimensions 2d + 1. Note that the edges of the image are not smoothed, as a border 
of un-smoothed pixels of ‘depth’ d exists around the images edges. Visually, this kernel smoothing 
method has the effect of blurring the image. In line 23, the function boxBlur () is parsed over the 
image for a value of d = 5, i.e. a 10 x 10 kernel. The smoothed data is then stored as blurImg. 
In lines 25-26 we use the argmax () function which is similar to findmax (), but only returns the 
index. We use it to find the index of the pixel with the largest value, for both the non-smoothed and 
smoothed image data. Note the use of the trailing . 1 at the end of each argmax (), which extracts 
the Tuple of values of the co-ordinates from the CartesianIndex type. As the Cartesian index 
of matrices is row major, we reverse the row and column order for the plotting that follows. The 
remaining lines create Figure [1.11] 





1.5 Random Numbers and Monte Carlo Simulation 


Many of the code examples in this book make use of pseudorandom number generation, often 
coupled with the so-called Monte Carlo simulation method for obtaining numerical estimates. The 
phrase ^Monte Carlo" associated with random number generation comes from the European province 
in Monaco famous for its many casinos. We now overview the core ideas and principles of random 
number generation and Monte Carlo simulation. 


The main player in this discussion is the rand () function. When used without input arguments, 
rand() generates a “random” number in the interval [0,1]. Several questions can be asked. How 
is it random? What does random within the interval [0, 1] really mean? How can it be used as an 
aid for statistical and scientific computation? For this we discuss pseudorandom numbers in a bit 
more generality. 


The “random” numbers we generate using Julia, as well as most “random” numbers used in any 
other scientific computing platform, are actually pseudorandom. That is, they aren't really random 
but rather appear random. For their generation, there is some deterministic (non-random and well 
defined) sequence, {£n }, specified by 


Crit = f Ons ig cds (1.5) 


originating from some specified seed, xo. The mathematical function, f(-) is often (but not always) 
quite a complicated function, designed to yield desirable properties for the sequence {£n} that make 
it appear random. Among other properties we wish for the following to hold: 
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(i) Elements x; and x; for i 4 j should appear statistically independent. That is, knowing the 
value of x; should not yield information about the value of zj. 


(ii) The distribution of {xn} should appear uniform. That is, there shouldn't be values (or ranges 
of values) where elements of {£n} occur more frequently than others. 


(iii) The range covered by {xn} should be well defined. 


(iv) The sequence should repeat itself as rarely as possible. 


Typically, a mathematical function such as f(-) is designed to produce integers in the range 
(0,...,2* — 1} where £ is typically 16, 32, 64 or 128 (depending on the number of bits used to 
represent an integer). Hence {£n} is a sequence of pseudorandom integers. Then if we wish to have 
a pseudorandom number in the range [0, 1] (represented via a floating point number), we normalize 


via, 
Tn 


Un = AT" 

When calling rand () in Julia (as well as in many other programming languages), what we are 
doing is effectively requesting the system to present us with Up. Then, in the next call, Un+1, and 
in the call after this Un+2 etc. As a user, we don't care about the actual value of n, we simply trust 
the computing system that the next pseudorandom number will differ and adhere to the properties 
(i) - (iv) mentioned above. 


One may ask, where does the sequence start? For this we have a special name that we call 
xo. It is known as the seed of the pseudorandom sequence. Typically, as a scientific computing 
system starts up, it sets xy to be the current time. This implies that on different system startups, 
Lo, £1, £2,... will be different sequences of pseudorandom numbers. However, we may also set the 
seed ourselves. There are several uses for this and it is often useful for reproducibility of results. 
Listing [1.14] illustrates setting the seed using Julia's Random. seed! () function. 


Listing 1.14: |Pseudorandom number generation 





1 using Random 

2 

3 Random. seed! (1974) 

4k perincian (Sed 19745 "rana, Mer, semel), "Wie", seul) 

5 Random. seed! (1975) 

6 ¡baca (Besa ions: Wesechacl()) Eran We, EO ) 

T Random.seed! (1974) 

e  joreatiote din (Seach 1e 7A S Vs), "WES O “vel, CA 
Seed 1974: 0.21334106865797864 0.12757925830167505 0.5047074487066832 
Seed 1975: 0.7672833719737708 0.8664265778687816 0.5807364110163316 
Seed 1974: 0.21334106865797864 0.12757925830167505 0.5047074487066832 


As can be seen from the output, setting the seed to 1974 produces the same sequence. However, 
setting the seed to 1975 produces a completely different sequence. 


One may ask why use random or pseudorandom numbers? Sometimes having arbitrary numbers 
alleviates programming tasks or helps randomize behavior. For example, when designing computer 
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Figure 1.12: Estimating m via Monte Carlo. 


video games, having enemies appear at random spots on the screen yields for a simple implemen- 
tation. In the context of scientific computing and statistics, the answer lies in the Monte Carlo 
simulation method. Here the idea is that computations can be aided by repeated sampling and 
averaging out the result. Many of the code examples in our book do this and we illustrate one such 
simple example below. 


Monte Carlo Simulation 


As an example of Monte Carlo, say we wish to estimate the value of 7. There are hundreds 
of known numerical methods to do this and here we explore one. Observe that the area of one 
quarter section of the unit circle is 7/4. Now if we generate random points, (x,y), within a unit 
box, [0, 1] x [0, 1], and calculate the proportion of total points that fall within the quarter circle, we 
can approximate 7 via, 

Number of points with z? + y? < 1 


à —4 
5 'Total number of points 





'This is performed in Listing for 10? points. The listing also creates Figure 


Listing 1.15: |Estimating 7 


using Random, LinearAlgebra, Plots; pyplot () 
Random. seed! () 


data = [lreanci) cand O for Tain Len] 
indata filter((x)-> (norm(x) <= 1), data) 
outdata MES (morm(x) 3 1), data) 
piApprox = 4x*length(indata) /N 

printin("Pi Estimate: ", piApprox) 





1 
2 
3 
4 
5 
6 
T 
8 
9 
10 
11 





scatter(first.(indata),last.(indata), c=:blue, ms-1, msw=0) 
scatter!(first.(outdata),last.(outdata), c=:red, ms-1, msw-0, 
xlims=(0,1), ylims-(0,1), legend=:none, ratio-:equal) 


E 
N 





m 
[v 





Pi Estimate: 3.14068 





36 CHAPTER 1. INTRODUCING JULIA - DRAFT 





In Line 2 the seed of the random number generator is set with Random.seed! (). This is done to 
ensure that each time the code is run the estimate obtained is the same. In Line 4, the number of 
repetitions, N, is set. Most code examples in this book use N as the number of repetitions in a Monte 
Carlo simulation. Line 5 generates an array of arrays. That is, the pair, [rand(),rand()] is an 
array of random coordinates in [0,1] x [0,1]. Line 6 filters those points to use for the numerator 
of fr. It uses the filter() function, where the first argument is an anonymous function, (x) -> 
(norm(x) <= 1). Here, norm() defaults to the Lə norm, i.e. 4/12 + y?. The resulting indata 
array only contains the points that fall within the unit circle (with each represented as an array of 
length 2). Line 7 creates the analogous outdata array. It is not used for the estimation, but is used 
in plotting. Line 8 calculates the approximation, with length () used for the numerator of 7 and N 
for the denominator. Lines 11-13 are used to create Figure [1.12] 





Inside a Simple Pseudorandom Number Generator 


Number theory and related fields play a central role in the mathematical study of pseudorandom 
number generation, the internals of which are determined by the specifics of f(-) of [(1.5)| How- 
ever, typically this is not of direct interest statisticians. Nevertheless, for exploratory purposes we 
illustrate how one can make a simple pseudorandom number generator. 


A simple to implement class of pseudo-random number generators is the class of Linear Con- 
gruential Generators (LCG). These types of LCGs are common in older systems. Here the function 
f(-) is nothing but an affine (linear) transformation modulo m, 


Ln+1 = (a £n +c) mod m. (1.6) 


The integer parameters a, c and m are fixed and specify the details of the LCG. Some number 
theory research has determined “good” values of a and c for specific values of m. For example, for 
m = 2%, setting a = 69069 and c = 1 yields sensible performance (other possibilities work well, but 
not all). In Listing [1.16] we generate values based on this LCG, see also Figure [1.13] 


Listing 1.16: |A linear congruential generator 


using Plots, LaTeXStrings, Measures; pyplot () 


a, E, m = 692009, T. 2°32 
next (z) (asz sp ©) = m 


N = 10^6 
data = Array{Float64,1} (undef, N) 


= 808 
for i in 1:N 
datali = s/m 
global x = next (x) 
end 


pL = scatter (le 1000, cara lil: t000], 
c=:blue, m=4, msw=0, xlabel=L"n", ylabel=L"x_n") 
p2 = histogram (data, bins=50, normed-:true, 
ylims=(0,1.1), xlabel="Support", ylabel="Density") 
plot (p1, p2, size=(800, 400), legend=:none, margin = 5mm) 
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Figure 1.13: Left: The first 1,000 values generated by a linear congruential 
generator, plotted sequentially. Right: A histogram of 10° random values. 





In line 4 [(1.6)]is implemented as the function next (). In line 7 an array of Float 64 of length N 
is preallocated. In line 9 the seed is arbitrarily set as the value 808. In lines 10-13 a loop is used N 
times. In line 11 the current value of x is divided by m to obtain a number in the range [0, 1]. Note 
that in Julia division of two integers results in a floating point number. In line 12 |[(1.6)|is applied 
recursively via next () to set a new value for x. In lines 15-16 a scatterplot of the first 1000 values of 
data is created, while lines 17-18 create a histogram of all values of data with 50 bins. As expected 
by the theory of LCG, a uniform distribution is obtained. 





More About Julia’s rand () 


Having covered the basics, we now describe a few more aspects of Julia’s random number gen- 
eration. The key function at play is rand(). However, as you already know, a Julia function 
may be implemented by different methods. The rand() function is no different. To see this, 
key in methods (rand) and you'll see dozens of different methods of rand(). Furthermore, if 
you do this after loading the Distributions package into the namespace (by running using 
Distributions) that number will grow substantially. Hence in short, there are many ways to use 
the rand() function in Julia. Throughout the rest of this book we use it in various ways, including 


in conjunction with probability distributions. However we now focus on functionality from the Base 
package. 


There are other functions related to rand(), such as randn() for generating normally dis- 
tributed random variables. Also after invoking using Random, the following functions are avail- 
able: Random. seed! (), randsubseq(), randstring(), randcycle(), bitrand(), as well 
as randperm() and shuffle () for permutations. There is also the MersenneTwister () con- 
structor among others. These are discussed in the Julia documentation. You may also use the 
built-in help to enquire about them. We now focus on the MersenneTwister () constructor and 
explain how it can be used in conjunction with rand() and variants. 
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Figure 1.14: Random walks with slightly different parameters. 
Left: Trajectories with same seed. Right: Different seed per trajectory. 


The term Mersenne Twister refers to a type of pseudorandom number generator. It is an 
algorithm that is considerably more complicated than the LCG described above. Generally, its 
statistical properties are much better than those of LCG. Due to this it has made its way into most 
scientific programming environments in the past two decades. Julia has adopted it as the standard 
as well. 


Our interest in mentioning the Mersenne Twister is due to the fact that in Julia we can create an 
object representing a random number generator implemented via this algorithm. To create such an 
object we write for example rng = MersenneTwister (seed), where seed is some initial seed 
value. Then the object rng acts as a random number generator, and may serve as an additional 
input to rand() and related functions. For example, calling rand (rng) uses the specific random 
number generator object passed to it. In addition to MersenneTwister (), there are also other 
ways to create similar objects, such as for example RandomDevice (). However we leave it to the 
reader to investigate these via the online help. 


By creating random number generator objects, you may have more than one random sequence 
in your application, essentially operating simultaneously. In Chapter we investigate scenarios 
where this is advantageous from a Monte Carlo simulation perspective. For now we show how a 
random number generator may be passed into a function as an argument, allowing the function to 
generate random values using that specific generator. 


Listing [1.17] creates random paths in the plane. Each path starts at (x,y) = (0,0) and moves 
up, right, down or left at each step. The movements up (x+=1) and right (y+=1) are with steps of 
size 1. However the movements down and left are with steps that are uniformly distributed in the 
range [0,2 +a]. Hence if a > 0, on average the path drifts in the down-left direction. The virtue of 
this initial example is that by using common random numbers and simulating paths for varying a, 
we get very different behavior than if we use a different set of random numbers for each path. See 
Figure We discuss more advanced applications of using multiple random number generators 
in Chapter however we implicitly use this Monte Carlo technique throughout the book, often 
by setting the seed to a specific value in the code examples. 
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Listing 1.17: |Random walks and seeds 


using Plots, Random, Measures; pyplot () 


function path(rng, alpha, n=5000) 
omo 
l, 


x, Y = 0.0, 
xDat, yDat = [ 
for En 
flip = rand(rng,1:4) 
if flip == 
x += 1 
elseif flip == 2 
y a= 1 
elseif flip == 
x — (2talpha)*rand (rng) 
elseif flip == 
y == (2talpha) *rand (rng) 


ET 


end 
push! (xDat, x) 
push! (yDat, y) 
end 
return xDat, yDat 
end 


alphaRange = [0.2, 0.21, 0.22] 


default (xlabel = "x", ylabel = "y", xlims-(-150,50), ylims=(-250, 50) ) 
pl = plot (path (MersenneTwister (27), alphaRange[1]), c=:blue) 
pl = plot! (path (MersenneTwister (27), alphaRange[2]), c=:red) 

Lo 7) 


2 
pl = t! (path (MersenneTwister(27), alphaRange[3]), c=:green) 








MersenneTwister (27) 

p2 lot (path (rng, alphaRange[1]), 
p2 lot! (path (rng, alphaRange[2]), 
p2 lot! (path (rng, alphaRange[3]) 


(a 











, 





plot(pl, p2, size=(800, 400), legend=:none, margin=5mm) 





Lines 3-21 define the function path (). As a first argument it takes a random number generator, rng. 
That is, the function is designed to receive an object such as MersenneTwister as an argument. 
'The second argument is alpha and the third argument is the number of steps in the path with a 
default value of 5000. In lines 6-19 we loop n times, each time updating the current coordinate (x 
and y) and then pushing the values into the arrays, xDat and yDat. Line 7 generates a random 
value in the range 1:4. Observe the use of rng as a first argument to rand(). In lines 13 and 
15 we multiply rand (rng) by (2*alpha). This creates uniform random variables in the range 
[0,2 +a]. Line 20 returns a tuple of two arrays xDat,yDat. After setting alphaRange in line 23 
and setting default plotting arguments in line 25, we create and plot paths with common random 
numbers in lines 26-28. This is because in each call to path () we use the same seed to a newly 
created MersenneTwister() object. Here 27 is just an arbitrary starting seed. In contrast, lines 
31-33 have repeated calls to path () using a single stream, rng, created in line 30. Hence here, we 
don't have common random numbers because each subsequent call to path () starts at a fresh point 
in the stream of rng. 
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1.6 Integration with Other Languages 


We now briefly overview how Julia can interface with the R-language, Python, and C. Note that 
there are several other packages that enable integration with other languages as well. 


Using and Calling R Packages 


R code, functions, and libraries can be called in Julia via the RCall package which provides 
several different ways of interfacing with R from Julia. When working with the REPL, one may 
use $ to switch between a Julia REPL and an R REPL. However in this case variables are not 
carried over between the two environments. The second way is via the @rput and @rget macros, 
which can be used to transfer variables from Julia to the R environment. Finally, the R""" (or 
@R_str) macro can also be used to parse R code contained within the string. This macro returns 
an RObject as output, which is a Julia wrapper type around an R object. 


We provide a brief example in Listing It is related to Chapter [7] and focuses on the 
statistical method of ANOVA (Analysis of Variance) covered in Section [7.3] The purpose here is to 
demonstrate R-interoperability, and not so much on ANOVA. This example calculates the ANOVA 
F-statistic and p-value, complementing Listing It makes use of the R aov() function and 
yields the same numerical results. 


Listing 1.18: [Using R from Julia 


using CSV, DataFrames, RCall 


datal = CSV.read("../data/machinel.csv", header=false) [:,1] 
data2 CSV.read("../data/machine2.csv", header=false) [:,1] 
data3 CSV.read("../data/machine3.csv", header=false) [:,1] 


function R_ANOVA(allData) 
data = vcat([ [x fill(i, length(x))] for (i, x) in 
enumerate(allData) J...) 
df = DataFrame (data, [:Diameter, :MachNo]) 
@rput df 


RN "nm 
df$MachNo <- as.factor (d£$MachNo) 
anova «- summary(aov( Diameter - MachNo, data-df)) 
awedi <= amoena ea | TE aE) a) TERES] 
¡al <= mera lO | ere (ei) a aby 
"nw 
println("R ANOVA f-value: ", @rget fVal) 
println("R ANOVA p-value: ", @rget pVal) 
end 


R ANOVA([datal, data2, data3]) 





R ANOVA f-value: 10.516968568709089 
R ANOVA p-value: 0.00014236168817139574 
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In line 1 we specify usage of the required packages, including RCa11. In lines 3-5 the data is loaded. In 
lines 7-21 we create the Julia function R_ANOVA, which takes a Julia array of arrays as input, allData. 
It outputs the summary results of an ANOVA test carried out in R via the aov () function. In lines 
8-9 the array of arrays allData is re-arranged into a 2-dimensional array, where the first column 
contains the observations from each of the arrays, and the second column contains the array index 
from which each observation has come. The data is re-arranged like this due to the format that the 
R aov () function requires. This re-arrangement is performed via the enumerate () function, along 
with the vcat () function and splat ‘...’ operator. In line 10, the 2-dimensional array data is 
converted to a DataFrame. Data frames are covered in Section In line 11 the @rput macro is 
used to transfer the data frame df to the R workspace. In lines 13-18 a multi-line R code block is 
executed inside the R""" macro. In line 14, the MachNo column of the R data frame df is defined as 
a factor, i.e. a categorical column via the R code as. factor () and <-. In line 15 an ANOVA test of 
the Diameter column of the R data frame df is conducted via aov () and passed to the summary () 
function, with the result stored as anova. In lines 16-17, the F-value and p-value is extracted from 
anova. Lines 19 and 20 are back to Julia where the output is printed. Note the use of @rget which 
is used to copy the variables from R back to Julia using the same name. 





In addition to various R functions, users of R will most likely also be familiar with R Datasets. 
This is a collection of datasets commonly used in teaching and exploring statistics. You can read 
more about R Datasets at, 








Access to this collection of datasets from Julia is possible via the RDatasets package. Once 
installed in Julia, datasets can be loaded by using the datasets () function and specifying an 
‘R datasets package name’ followed by a ‘dataset name’. For example, datasets ("datasets", 
"mtcars"), will load mtcars. Several code listings in this book use R datasets. 


Using and Calling Python Packages 


It is possible to import Python modules and call Python functions directly in Julia via the 
PyCall package. It automatically converts types, and allows data structures to be shared between 
Python and Julia. By default, add PyCall uses the Conda package to install a minimal Python 
distribution that is private to Julia. Further python packages can then be installed from within 
Julia via the Julia Conda package. 


Alternatively, one can use a pre-existing Python installation on the system. In order to do 
this, one must first set the Python environment variable to the path of the executable, and then 
re-build the PyCall package. For example, on a system with Anaconda installed, one would issue 
commands similar to the below from within the Julia REPL: 


] add PyCall 


ENV["PYTHON"] = "C:\\Program Files\\Anaconda3\\python.exe" 





] build PyCall 
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We now provide a brief example which makes use of the TextBlob Python library, which provides 
a simple API for conducting Natural Language Processing (NLP) tasks, including part-of-speech 
tagging, noun phrase extraction, sentiment analysis, classification, translation, and more. For our 
example we use TextBlob to analyze the sentiment of several sentences. The sentiment analyzer of 
TextBlob outputs a tuple of values, with the first value being the polarity of the sentence (a rating 
of positive to negative), and the second value a rating of subjectivity (factual to subjective). 


In order for Listing to work, the TextBlob Python library must first be installed. The lines 
below do this when executed in a shell or command prompt. Note that one can swap from the Julia 
REPL to a shell via ‘;’. 


pip3 install -U textblob 


python -m textblob.download_corpora 


Once Python and TextBlob are configured, Listing can be executed. This example only 
briefly touches on the PyCa11 package with more information available in the package documenta- 
tion. 


Listing 1.19: INLP via Python’s TextBlob 


using PyCall 
TB = pyimport ("textblob") 


Str = 

"""Some people think that Star Wars The Last Jedi is an excellent movie, 
with perfect, flawless storytelling and impeccable acting. Others 

think that it was an average movie, with a simple storyline and basic 
acting. However, the reality is almost everyone felt anger and 
disappointment with its forced acting and bad storytelling.""" 


1 
2 
3 
4 
5 
6 
7 
8 





blob = TB.TextBlob (str) 
[ i.sentiment for i in blob.sentences ] 








(0.625, 0.636) 
(-0.0375, 0.221) 
(-0.46, 0.293) 





In line 2 the pyimport () function is used to wrap the Python library textblob, which is then given 
the Julia alias TB. In lines 4-9 the string st r is created. For this example, the string is written as a first 
hand account, and contains many words that give the text a negative tone. Note the use of multi-line 
strings using """. In line 11 the TextBlob () function from TB is used to parse each sentence in str. 
The output is stored as blob. This is where the call to Python is made. In line 12 a comprehension 
is used to print the sentiment field for each sentence in blob. Note that sentiment is a Python 
based field name accessible via Julia. As detailed in the TextBlob documentation, the sentiment of 
the blob is as an ordered pair of polarity and subjectivity, with polarity measured over [—1.0, 1.0] 
(very negative to very positive), and subjectivity over [0.0, 1.0] (very objective to very subjective). 
'The results indicate that the first sentence is the most positive but is also the most subjective, while 
the last sentence, is the most negative but also more objective. 
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Other Integrations 


Julia also allows C and Fortran calls to be made directly via the cca11 () function, which is in 
Julia Base. These calls are made without adding any extra overhead than a standard library call 
from C code. Note that the code to be called must be available as a shared library. For example, 
in Windows systems, msvcrt can be called instead of 1ibc (msvert is a module containing C 
library functions, and is part of the Microsoft C Runtime Library). 


When using the cca11 () function, shared libraries are referenced in the format (: function, 
"library"). The following is an example where the C function cos () is called, 


ccall( (:cos, "msvcrt"), Float64, (Float64,), pi ). 


For this example, the cos () function is called from the msvcrt library. Here, ccall() takes four 
arguments, the first is the function and library as a tuple, the second is the return type, the third 
is a tuple of input types (here there is just one), and the last is the input argument, 7 in this case. 
Running this in Julia on a Windows machine returns —1. 


There are also several other packages that support various other languages as well, such as 
Cxx.jl and CxxWrap.j1l for C++, MATLAB. 41 for Matlab, and JavaCa11.31 for Java. Note 


that many of these packages are available from https://github.com/JuliaInterop 





Chapter 2 


Basic Probability - DRAFT 


In this chapter we introduce elementary probability concepts. We describe key notions of a 
probability space along with independence and conditional probability. It is important to note 
that most of the probabilistic analysis carried out in statistics is based on distributions of random 
variables. These are introduced in the next chapter. In this chapter we focus solely on probability, 
events, and the simple mathematical set-up of a random experiment embodied in a probability 
space. 


The notion of probability is the chance of something happening, quantified as a number between 
0 and 1 with higher values indicating a higher likelihood of occurrence. However, how do we 
formally describe probabilities? The standard way to do this is to consider a probability space; which 
mathematically consists of three elements: (1) A sample space - the set of all possible outcomes of 
a certain experiment. (2) A collection of events - each event is a subset of the sample space. (3) 
A probability measure also denoted here as probability function - which indicates the chance of each 
possible event occurring. Note: do not confuse this with a probability mass function, which we 
define in the next chapter. 


As a simple example, consider the case of flipping a coin twice. Recall that the sample space is 
the set of all possible outcomes. We can represent the sample space mathematically as follows, 


Q = (hh, ht, th, tt}. 


Now that the sample space, €), is defined, we can consider individual events. For example, let A be 
the event of getting at least one heads. Hence, 


A= (hh, ht, th}. 
Or alternately, let B be the event of getting one heads and one tails in any order, 
B= {ht,th}. 


There can also be events that consist of a single possible outcome, for example C = {th} is the 
event of getting tails first, followed by heads. Mathematically, the important point is that events are 
subsets of Q and often contain more than one outcome. Possible events also include the empty set, 
() (nothing happening) and Q itself (something happening). In the setup of probability, we assume 
there is a random experiment where something is bound to happen. 
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The final component of a probability space is the probability function, also sometimes called 
probability measure. This function, P(-), takes an event as an input argument and returns real 
numbers in the range [0, 1]. It always satisfies P(@) = 0 and P(Q) = 1. It also satisfies the fact that 
the probability of the union of two disjoint events is the sum of their probabilities, and furthermore 
the probability of the complement of an event is one minus the original probability. 


This chapter is structured as follows: In Section we explore the basic setup of random 
experiments with a few examples. In Section we explore working with sets in Julia as well 
as probability examples dealing with unions of events. In Section we introduce and explore 
the concept of independence. In Section we move on to conditional probability. Finally, in 
Section [2.5] we explore Bayes’ rule for conditional probability. 


2.1 Random Experiments 


We now explore a few examples where we set-up a probability space. In most examples we present 
a Monte Carlo simulation of the random experiment, and then compare results to theoretical ones 
where possible. 


Rolling Two Dice 


Consider the random experiment where two independent, fair, six sided dice are rolled, and we 
wish to find the probability that the sum of the outcomes of the dice is even. Here the sample space 
can be represented as Q = {1,...,6}7, i.e. the Cartesian product of the set of single roll outcomes 
with itself. That is, elements of the sample space are tuples of the form (i, j) with i, j € {1,...,6}. 
Say we are interested in the probability of the event, 


A= {(i,j) | i+ 7 is even}. 


In this random experiment, since the dice have no inherent bias, it is sensible to assume a symmetric 
probability function. That is, for any B C Q, 























B 
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Table 2.1: All possible outcomes for the sum of two dice. Even sums are shaded. 
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where | - | counts the number of elements in the set. It is called symmetric because every outcome 
in Q has the same probability. Hence for our event, A, we can see from Table 2.1] that, 
18 
P(A) = — = 0.5. 


We now obtain this in Julia via both direct calculation and Monte Carlo simulation. A direct 
calculation counts the number of even faces. A Mote Carlo simulation repeats the experiment many 
times and estimates P(A) based on the number of times that event A occurred. 


Listing 2.1: [Even sum of two dice 


N, faces 


numSol = sum([iseven(i-*j) for i in faces, j in faces]) / length(faces)^2 
mcEst sum([iseven(rand(faces) + rand(faces)) for i in 1:N]) / N 











println("Numerical solution = $numSol \nMonte Carlo estimate = $mcEst") 


Numerical solution = 0.5 
= 0. 


Monte Carlo estimate 499644 





In line 1 we set the number of simulation runs, N, and the range of faces on the dice, 1:6. In line 
3, we use a comprehension to cycle through the sum of all possible combinations of the addition of 
the outcomes of the two dice. The outcome of the two dice are represented by i and 3 respectively, 
both of which take on the values of faces. We start with i=1, j=1 and add them, and we use the 
iseven() function to return true if even, and false if not. We then repeat the process for i=1, 
j—2 and so on, all the way to i=6, j=6. Finally, we count the number of true values by summing 
all the elements of the comprehension via sum(). The result, normalized by the total number of 
possible outputs is stored in numSol. Line 4 also uses a comprehension, but in this case we uniformly 
and randomly select the values which the dice take, akin to rolling them. Again iseven () is used 
to return true if even and false if not, and we repeat this process N times. Using similar logic to 
line 3, we store the proportion of outcomes which were true in mcEst. Line 6 prints the results using 
the println() function. Notice the use of Xn for creating a newline. 





Partially Matching Passwords 


We now consider an alphanumeric example. Assume that a password to a secured system is 
[nee RU oe 


exactly 8 characters in length. Each character is one of 62 possible characters: the letters ‘a’—‘z’, 
the letters ‘A’—‘Z’ or the digits ‘0’—‘9’. 


In this example let Q be the set of all possible passwords, i.e. |Q| = 62°. Now, again assuming 
a symmetric probability function, the probability of an attacker guessing the correct (arbitrary) 
password is 6278 = 4.6 x 10715. Hence at a first glance, the system seems very secure. 


Elaborating on this example, let us also assume that as part of the system’s security infrastruc- 
ture, when a login is attempted with a password that matches 1 or more of the characters, an event 
is logged in the system’s security portal (taking up hard drive space). For example, say the original 
password is 3xyZu4vN, and a login is attempted using the password 35xyZ4vN. In this case 4 of 
the characters match (displayed in bold) and therefore an event is logged. 
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While the chance of guessing a password and logging in seems astronomically low, in this simple 
(fictional and overly simplistic) system, there exists a secondary security flaw. That is, hackers may 
attempt to overload the event logging system via random attacks. If hackers continuously try to 
log into the system with random passwords, every password that matches one or more characters 
will log an event, thus taking up more hard-drive space. 


We now ask what is the probability of logging an event with a random password? Denote the 
event of logging a password A. In this case, it turns out to be much more convenient to consider 
the complement, A* :— Q \ A, which is the event of having 0 character matches. We have that 
|A°| = 61? because given any (arbitrary) correct password, there are 61 = 62 — 1 character options 
for each character, in order ensure A* holds. Hence, 


e 618 


We then have that the probability of logging an event is P(A) = 1 — P(A*) ~ 0.12198. So if, for 
example, 10’ login attempts are made, we can expect that about 1.2 million login attempts would 
be written to the security log. We now simulate such a scenario in Listing [2.2] 


Listing 2.2: [Password matching 


using Random 
Random. seed! () 


passLength, numMatchesForLog = 8, 1 
POSsaloleChyercss c [alga p UN een” 


correctPassword = "3xyZu4vN" 


numMatch(loginPassword) = 
sum([loginPassword[i] == correctPassword[i] for i in 1:passLength] ) 


N = 10^7 


passwords = [String(rand(possibleChars,passLength)) for _ in 1:N] 
numLogs = sum([numMatch(p) >= numMatchesForLog for p in passwords] ) 
println("Number of login attempts logged: ", numLogs) 
println("Proportion of login attempts logged: ", numLogs/N) 





Number of login attempts logged: 1221801 
Proportion of login attempts logged: 0.1221801 
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In line 2 the seed of the random number generator is set so that the same passwords are generated 
each time the code is run. This is done for reproducibility. In line 4 the password length is defined 
along with the minimum number of character matches before a security log entry is created. In line 5 
an array is created, which contains all valid characters which can be used in the password. Note the 
use of ’;’, which performs array concatenation of the three ranges of characters. In line 7 we set an 
arbitrary correct login password. Note that the type of correctPassword is a String containing 
only characters from possibleChars. In lines 9 and 10 the function numMat ch () is created, which 
takes the password of a login attempt and checks each index against that of the actual password. If 
the index character is correct, it evaluates true, else false. The function then returns how many 
characters were correct by using sum(). Line 14 uses the function rand() and the constructor 
String() along with a comprehension to randomly generate N passwords. Note that String () 
is used to convert from an array of single characters to a string. Line 15 checks how many times 
numMatchesForLog or more characters were guessed correctly, for each password in our array of 
randomly generated passwords. It then stores how many times this occurs as the variable numLogs. 





The Birthday Problem 


For our next example, consider a room full of people. We then ask what is the probability of 
finding a pair of people that share the same birthday. Obviously, ignoring leap years, if there are 
366 people present, then it happens with certainty via the pigeonhole principle. However, what if 
there are fewer people? Interestingly, with about 50 people, a birthday match is almost certain, and 
with 23 people in a room, there is about a 50% chance of two people sharing a birthday. At first 
glance this non-intuitive result is surprising, and hence this famous probability example earned the 
name the birthday paradox. However, we just refer to it as the birthday problem. 


To carry out the analysis, we assume birthdays are uniformly distributed in the set (1,...,365]. 
For n people in a room, we wish to evaluate the probability that at least two people share the 
same birthday. Set the sample space, 2, to be composed of ordered tuples (11,...,t,) with x; € 
{1,...,365}. Hence, |Q| = 365". Now set the event A to be the set of all tuples (x1,...,2;) where 
x; = rj for some distinct 7 and j. 


As in the previous example, we consider A^ instead. It consists of tuples where x; Æ x; for all 
distinct i and j (the event of no birthday pair in the group). In this case, 








365! 
| A°| = 365 - 364 - ...- (365 E aes ca 
Hence we have, 
Ac -364-...- = 1 
P(A) =1—P(A°) 21 x sor E 2 i 


From this we can compute that for n = 23, P(A) = 0.5073, and for n = 50, P(A) = 0.9704. 


The code in Listing calculates both the analytic probabilities, as well as estimates them 
via Monte Carlo (MC) simulation. The results are presented in Figure For the numerical 
solutions, it employs two alternative implementations, matchExists1() and matchExists2(). 
The maximum error between the two numerical implementations is presented. 
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Listing 2.3: |The birthday problem 


using StatsBase, Combinatorics, Plots ; pyplot() 


MEIE Cla Sie Sil (a) = 1 = prodik S65 Beja le tin 365218 3059=a9-1 || ) 
matchExists2 (n) i= factorial (365, 365-93: (m) ) / 265 birom) 





function bdEvent (n) 
birthdays = rand(1:365,n) 
dayCounts = counts(birthdays, 1:365) 
return maximum(dayCounts) > 1 

end 











probEst (n) = sum([bdEvent (n) for _ in 1:N])/N 


xGrid = 1:50 

analyticSolutionl = [matchExistsl(n) for n in xGrid] 

analyticSolution2 = [matchExists2(n) for n in xGrid] 

println("Maximum error: $(maximum(abs. (analyticSolutionl - analyticSolution2)))") 





LOA 
Estimates = [probEst(n) for n in xGrid] 











plot (xGrid, analyticSolutionl, c=:blue, label="Analytic solution") 
scatter! (xGrid, mcEstimates, c=:red, ms=6, msw=0, shape-:xcross, 
label="MC estimate", xlims=(0,50), ylims=(0, 1), 
xlabel="Number of people in room", 
ylabel="Probability of birthday match", 
legend=:topleft) 














Maximum error: 2.4611723650627278208929385e-16 
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Figure 2.1: Probability that in a room of n people, 
at least two people share a birthday. 
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In lines 3 and 4, two alternative functions for calculating the probability in are defined, 
matchExistsl () and matchExists2 () respectively. The first uses the prod () function to apply 
a product over a comprehension. This is in fact a numerically stable way of evaluating the probability. 
The second implementation evaluates[(2-1]]in a much more explicit manner. It uses the factorial () 
function from the Combinatorics package. Note that the basic factorial() function is included 
in Julia Base, however the method with two arguments comes from the Combinatorics package. 
Also, the use of big () ensures the input argument is a BigInt type. This is needed to avoid overflow 
for non-small values of n. Lines 6-10 define the function bdEvent (), which simulates a room full of n 
people, and if at least two people share a birthday, returns t rue, otherwise returns false. We now 
explain how it works. Line 7 creates the array birthdays of length n, and uniformly and randomly 
assigns an integer in the range [1, 365] to each index. The values of this array can be thought of as the 
birth dates of individual people. Line 8 uses the function counts () from the StatsBase package 
to count how many times each birth date occurs in birthdays, and assigns these counts to the new 
array dayCounts. The logic can be thought of as follows: if two indices have the same value, then 
this represents two people having the same birthday. Line 9 checks the array dayCounts, and if 
the maximum value of the array is greater than one (i.e. if at least two people share the same birth 
date) then returns true, else false. Line 12 defines the function probEst (), which, when given 
n number of people, uses a comprehension to simulate N rooms, each containing n people. For each 
element of the comprehension, i.e. room, the bdEvent () function is used to check if at least one 
birthday pair exists. Then, for each room, the total number of at least one birthday pair is summed 
up and divided by the total number of rooms N. For large N, the function probEst () will be a good 
estimate for the analytic solution of finding at least one birthday pair in a room of n people. Lines 
14-17 evaluate the analytic solutions over the grid, xGrid, and prints the maximal absolute error 
between the solutions. The output shows that the numerical error is negligible. Line 20 evaluates the 
Monte Carlo estimates. Lines 22-27 plot the analytic and numerical estimates of these probabilities 
on the same graph. 























Sampling With and Without Replacement 


Consider a small pond with a small population of 7 fish, 3 of which are gold and 4 of which are 
silver. Now say we fish from the pond until we catch 3 fish, either gold or silver. Let Gn denote the 
event of catching n gold fish. It is clear that unless n = 0,1,2 or 3, P(G,,) = 0. However, what is 
P(G,,) for n = 0,1,2,3? Before continuing, let us make a distinction between two sampling policies: 


Catch and keep - We sample from the population without replacement. That is, whenever we 
catch a fish, we remove it from the population. 


Catch and release - We sample from the population with replacement. That is, whenever we 
catch a fish, we return it to the population (pond) before continuing to fish. 


The computation of the probabilities P(G,,) for these two cases of catch and keep, and catch and 
release, may be obtained via the Hypergeometric distribution and Binomial distribution respectively. 
These are both covered in more detail in Section We now estimate these probabilities using 
Monte Carlo simulation. Listing 2.4] below simulates each policy N times, counts how many times 
zero, one, two and three gold fish are sampled in total, and finally presents these as proportions 
of the total number of simulations. Note that the total probability in both cases sum to one. The 
probabilities are plotted in Figure [2.2] 
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Listing 2.4: Fishing with and without replacement 


using StatsBase, Plots ; pyplot () 


function proportionFished (gr, sF,n,N, withReplacement = false) 
function fishing() 
fishInPond = [ones(Int64,gF); zeros(Int64,sF) ] 
fishCaught = Int64[] 


EOL isla in sa 
fished = rand(fishInPond) 
push! (fishCaught, fished) 
if withReplacement == false 

deleteat! (fishInPond, findfirst (x->x==fished, fishInPond) ) 

end 

end 

sum (fishCaught) 

end 


simulations = [fishing() for _ in 1:N] 
proportions counts (simulations, 0:n) /N 


if withReplacement 
ploe! (gm, j9xeeipxoscE omisi 
line=:stem, marker=:circle, c=:blue, ms=6, msw=0, 
label="With replacement", 
xlabel="n", 
ylims=(0, 0.6), ylabel="Probability") 





else 
pale! (Osim, pre SORE 19m 
line=:stem, marker=:xcross, c=:red, ms=6, msw=0, 
label="Without replacement") 





= 10%6 
goldFish, silverFish, 
plot () 
proportionFished(goldFish, silverFish, n, N) 
proportionFished(goldFish, silverFish, n, N, true) 
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Figure 2.2: Estimated probabilities of catching n of gold fish, 
with and without replacement. 





Lines 3-32 define the function proportionFished(), which takes five arguments: the number of 
gold fish in the pond gF, the number of silver fish in the pond sF, the number of times we catch a fish 
n, the total number of simulation runs N, and a policy of whether we throw back (i.e. replace) each 
caught fish, withReplacement, which is set to false by default. In lines 4-16 we create an inner 
function £ishing() that generates one random instance of a fishing day, returning the number of 
gold fish caught. Line 5 generates an array, where the values in the array represent fish in the pond, 
with 0’s and 1’s representing silver and gold fish respectively. Notice the use of the zeros() and 
ones () functions, each with a first argument, Int64 indicating the Julia type. Line 6 initializes an 
empty array, which represents the fish to be caught. Lines 8-14 perform the act of fishing n times via 
the use of a for loop. Lines 9-10 randomly sample a “fish” from our “pond”, and then stores this in 
value in our fishCaught array. Line 12 is only run if false is used, in which case we “remove” the 
caught “fish” from the pond via the function deleteat! (). Note that technically we don't remove 
the exact caught fish, but rather a fish with the same value (0 or 1) via findfirst (). Our use of 
this function returns the first index in fishInPond with a value equalling fished. Line 15 is the 
(implicit) return statement for the function fishing () and is the sum of how many gold fish were 
caught (since gold fish are stored as 1’s and silver fish as 0’s). Line 18 implements our chosen policy N 
times total, with the total number of gold fish each time stored in the array simulations. Line 19 
uses the counts () function to return the proportion of times 0,...,n gold fish were caught. Lines 21- 
31 then use plot! () to overlay the existing plot with the probabilities. The proportionFished() 
function is then called twice in lines 37 and 38 to generate the resulting plot. 
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Lattice Paths 


We now consider a square grid on which an ant walks from the south west corner to the north 
east corner, taking either a step north or a step east at each grid intersection. This is illustrated in 
Figure [2.3] where it is clear that there are many possible paths the ant could take. Let us set the 
sample space to be, 

Q = All possible lattice paths, 


where the term lattice path describes a trajectory of the ant going from the south west point, (0,0) 
to the north east point, (n,n). Since €) is finite, we can consider the number of elements in it, 
denoted |Q|. For a general n x n grid, 


2H (2n)! 
a = (7) - 8. 
For example if n = 5 then |Q| = 252. The use of the binomial coefficient here is because out of the 
2n steps that the ant needs to take, n steps need to be ‘north’ and n need to be ‘east’. 





Within this context of lattice paths, there are a variety of questions. One common question has 
to do with the event (or set): 


A = Lattice paths that stay above the diagonal the whole way from (0,0) to (n, n). 


The set A then describes all lattice paths where at any point, the ant has not taken more easterly 
steps than northerly steps. The question of the size of A, namely |A|, has interested many people 
in combinatorics, and it turns out that, 


C 


n+1' 





|A| = 


For each counting value of n, the above is called the n’th Catalan Number. For example, if n = 1 
then |A| = 1, if n = 2, |A| = 2 and if n = 3 then |A| = 5. You can try to sketch all possible paths 
in A for n = 3 (there are 5 in total). 


So far we have discussed the sample space Q, and a potential event A. One interesting question 
to ask deals with the probability of A. That is: What is the chance that the ant stays on or above 
the diagonal as it journey’s from (0,0) to (n,n)? 


The answer to this question depends on the probability function/measure that we specify for 
this experiment (sometimes called a probability model). There are infinity many choices for the 
model and the choice of the right model depends on the context. Here we consider two examples: 


Model I - As in the previous examples, assume a symmetric probability space, i.e. each lattice 
path is equally likely. For this model, obtaining probabilities is a question of counting and 
the result just follows the combinatorial expressions above: 


dA | 1 
—.|Q| nti 





Pr(A) (2.2) 


Model II - We assume that at each grid intersection where the ant has an option of where to 
go (‘east’ or north”), it chooses either east or north, both with equal probability 1/2. In the 
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case where there is no option for the ant (i.e. it hits the east or north border) then it simply 
continues along the border to the final destination (n,n). For this model, it isn't as simple 
to obtain an expression for P(A). One way to do it is by considering a recurrence relation for 
the probabilities (sometimes known as first step analysis). We omit the details and present 
the result: 

7 ¡Al 7 (7t) 


n 


Ir )- | T 92n-1* 





Hence we see that the probability of the event, depends on the probability model used - and this 
choice is not always a straightforward nor obvious one. For example, for n = 5 we have, 


126 


A) and Prt(A) by simulating both Model I and Model II in 
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Figure 2.3: Example of two different lattice paths. 
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Listing 2.5: [Lattice paths 


using Random, Combinatorics, Plots, LaTeXStrings ; pyplot() 
Random. seed! (12) 


function isUpperLattice (v) 
for i in 1:Int(length(v)/2) 
sum(v[1:2*i-1]) >= i ? continue : return false 
end 
return true 
end 


omega = unique (permutations ([zeros(Int,n);ones(Int,n)])) 
A = omega[isUpperLattice. (omega) ] 
pA modelI = length (A) /length (omega) 





function randomWalkPath (n) 
x, ye = CO, O 
path = [] 
while x<n && y<n 
If cand (O) ««(0 5.5 
x += 1 
push! (path, 0) 
else 
y += 1 
push! (path, 1) 
end 
end 
append! (path, x<n ? zeros(Int64,n-x) : ones(Int64,n-y)) 
return path 
end 


pA_modelllest = sum([isUpperLattice(randomWalkPath(n)) for _ in 1:N])/N 
println("Model I: ",pA modelI, "Nt Model II: ", pA modelIIest) 


function plotPath(v,l,c) 
Sw = 0,0 
graphX, graphy 
forming 
if i -- 
x += 1 
else 
se => dL 
end 
push! (graphX,x), push! (graphY, y) 
end 
plot! (graphX, graphY, 
la=0.8, lw=2, label=1, c=c, ratio=:equal, legend=:topleft, 
xlims=(0,n), ylims=(0,n), 
xlabel=L"East\rightarrow", ylabel=L"North\rightarrow") 








end 

plot () 

plotPath(rand(A), "Upper lattice path", :blue) 
plotPath(rand(setdiff(omega,A)), "Non-upper lattice path", :red) 
ple! (10, m1, [Om], isses, c=gisilack, aded") 





Model I: 0.16666666666666666 Model II: 0.24696 
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In the code, a path is encoded by a sequence of 0 and 1 values, indicating “move east” or “move 
north” respectively. The function isUpperLattice() defined in lines 5-10 checks if a path is an 
upper lattice path by summing all the odd partial sums, and returning false if any sum ends up at 
a coordinate below the diagonal. Note the use of the ? : operator in line 7 . Also note that 
in line 6, Int () is used to convert the division length (v) /2 to an integer type. In line 12, a 
collection of all possible lattice paths is created by applying the permutations () function from the 
Combinatorics package to an initial array of n zeros and n ones. The unique () function is then 
used to remove all duplicates. In line 13 the isUpperLattice() function is applied to each element 
of omega via the ‘.’ operator just after the function name. The result is a boolean array. Then 
omega[] selects the indices of omega where the value is true and in the next line pÀ modelI is 
calculated. In lines 17-31 the function randomWalkPath () is implemented, which creates a random 
path according to Model II. Note that the code in line 29 appends either zeros or ones to the path, 
depending on if it hit the north boundary or east boundary first. Then in line 33, the Monte Carlo 
estimate, pÀ modelIIest is determined. The function plotPath () defined in lines 36-50 plots a 
path with a specified label and color. It is then invoked in line 52 for an upper lattice path selected via 
rand (A) and again in the next line for a non-upper path by using setdiff (omega, A) to determine 
the collection of non upper lattice paths. Functions dealing with sets are covered in more detail in 
the next section. 





2.2 Working With Sets 


As evident from the examples in Section above, mathematical sets play an integral part in 
the evaluation of probability models. Subsets of the sample space €) are also called events. By 
carrying out intersections, unions and differences of sets, we may often express more complicated 
events based on smaller ones. 


A set is an unordered collection of unique elements. A set A is a subset of the set B if every 
element that is in A is also an element of B. The union of two sets, A and B, denoted AU B is 
the set of all elements that are either in A or B, or both. The intersection of the two sets, denoted 
An B, is the set of all elements that are in both A and B. The difference, denoted A \ B is the set 
of all elements that are in A but not in B. 


In the context of probability, the sample space 2 is often considered as the universal set. This 
allows us to then consider the complement of a set A, denoted A“, which can be constructed via 
all elements of Q that are not in A. Note that A^ = Q \ A. Also observe that in the presence of a 
universal set: A \ B= An B°. 


Representing Sets in Julia 


Julia includes built-in capability for working with sets. Unlike an Array, a Set is an unordered 
collection of unique objects. Listing illustrates how to construct a Set in Juila, and illus- 
trates the use of the union(), intersect (), setdiff(), issubset () and in() functions. 
There are also other functions related to sets that you may explore independently. These include 
issetequal() symdiff(), union! (), setdiff! (), symdiff! () and intersect! (). See 
the online Julia documentation under “Collections and Data Structures”. 
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Listing 2.6: Basic set operations 


A Se (2, 75 27 3) 
B = Set(1:6) 
omega = Set (1:10) 


AunionB = union(A, B) 

AintersectionB = intersect (A, B) 

BdifferenceA = setdiff(B,A) 

Bcomplement = setdiff (omega, B) 

AsymDifferenceB = union(setdiff(A,B),setdiff (B,A)) 
println("A = SA, B = SB") 

[OX 3LfaE unton 12) = Saulo) 


1 
2 
3 
4 
5 
6 
Y 
8 


print intersection B = SAintersectionB") 

print diff A = $BdifferenceA") 

joe iia complement = $Bcomplement") 

prin symDifference B = SAsymDifferenceB") 

prin "The element ’6’ is an element of A: $(in(6,A))") 

print "Symmetric difference and intersection are subsets of the union: ", 
issubset (AsymDifferenceB, AunionB),", ", issubset (AintersectionB, AunionB) ) 








B= Set([4, 2, 3, 5, 6, 11) 


= Set([7,. 2, 31), 
(I7, 4, 2, 3, 5, 6, 1]) 


A 

A union B = Set 
A intersection B 
B 
B 
A 


diff A = Set([4, 5, 6, 1]) 
complement = Set([7, 9, 10, 8]) 


symDifference B = Set([7, 4, 5, 6, 11) 
The element '6' is an element of A: false 
Symmetric difference and intersection are subsets of the union: true, true 





In lines 1-3 three different sets are created via the Set () function (a constructor). Note that A contains 
only three elements, since sets are meant to be a collection of unique elements. Also note that unlike 
arrays order is not preserved. Lines 5-9 perform various operations using the sets created. Lines 10-18 
create the listing output. Note the use of the functions in () and issubset () in lines 16-18. 





The Probability of a Union 


Consider now two events (sets) A and B. If An B = Í, then P(AU B) = P(A) +P(B). However 
more generally, when A and B are not disjoint, the probability of the intersection, AM B plays a 
role. For such cases the inclusion exclusion formula is useful: 


P(AU B) = P(A) + P(B) -P(AN B). (2.3) 


To help illustrate this, consider the simple example of choosing a random lower case letter, ‘a’-‘z’. 
Let A be the event that the letter is a vowel (one of ‘a’, ‘e’, ‘i’, ‘o’, ‘u’). Let B be the event that 
the letter is one of the first three letters (one of ‘a’, ‘b’, ‘c’). Now since An B = {‘a’}, a set with 


one element, we have, 
5 3 1 7 
ens?) 26 " 26 26 26 
For another similar example, consider the case where A is the set of vowels as before, but B = 


{‘x’, y”, ‘z’}. In this case, since the intersection of A and B is empty, we immediately know that 
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P(AU B) = (5 + 3)/26 ~ 0.3077. While this example is elementary, we now use it to illustrate a 
type of conceptual error that one may make when using Monte Carlo simulation. 


Consider code Listing [2.7] and compare mcEst1 and mcEst2 from lines 12 and 13 respectively. 
Both variables are designed to be estimators of P(AU B). However, one of them is a correct estimator 
and the other is faulty. In the following we look at the output given from of both, and explore the 
fault in the underlying logic. 


Listing 2.7: |An innocent mistake with Monte Carlo 


using Random, StatsBase 
Random. seed! (1) 


1 
2 
3 
4 
5 
6 
Y 
8 


joreatione in Aneel Wc Mecca) 

for in 5 
mcEst1 = sum( [in (sample (omega),A) || in(sample(omega),B) for _ in 1:N])/N 
mcEst2 = sum( [in (sample (omega) union (A,B)) for _ in 1:N])/N 
println(mcEst1,"\t",mcEst2) 

















end 





First observe line 12. In Julia, | | means “or”, so at first glance the estimator mcEst1 looks 
sensible, since: 


AU B = the set of all elements that are in A or B. 


Hence we are generating a random element via sample (omega) and checking if it is an element 
of A or an element of B. However there is a subtle error. Each of the N random experiments 
involves two separate calls to sample (omega). Hence the code in line 12 simulates a situation 
where conceptually, the sample space, 2 is composed of pairs of letters (2-tuples), not single letters! 


Hence the code computes probabilities of the event, A; U B» where, 


A, = First element of the tuple is a vowel, 


E? 0 


Bə = Second element of the tuple is an ‘x’, ‘y’, or ‘z’ letter. 


Now observe that A, and Ba are not disjoint events, hence, 





P(A U B3) = P(A1) + IP(B3) €: P(Ai N Bə). 











Further it holds that P(A; N B2) = P(A1)P(B3). This follows from independence (further explored 
in Section 2.3). Now that we have identified the error, we can predict the resulting output. 


5 3 5.3 
P(Ai U B3) = P(A1) + IP(B3) — P(A1)P(Bo) = 26 + 26 2626 = 0.2855. 





It can be seen from the code output, which repeats the comparison 5 times, that mcEst 1 consistently 
underestimates the desired probability, yielding estimates near 0.2855 instead. 
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mcEst1 mcEst2 

0.285158 0.307668 
0.285686 0.307815 
0.285022 0.308132 
0.285357 0.307261 
0.285175 0.306606 





In lines 11-15 a for loop is implemented, which generates 5 Monte Carlo predictions. Note that lines 
12 and 13 contain the main logic of this example. Line 12 is our incorrect simulation, and yields 
incorrect estimates. See the text above for a detailed explanation as to why the use of two separate 
calls to sample () are incorrect in this case. Line 13 is our correct simulation, and for large N yields 
results close to the expected result. Note that the union() function is used on A and B, instead of 
the “or” operator, | |, used in line 12. The important point is that only a single sample is generated 
for each iteration of the composition. 





Secretary with Envelopes 


Now consider a more general form of the inclusion exclusion principle applied to a collection of 
sets, C1,..., Cn. It is presented below, written in two slightly different forms: 


" (Ue) = Dre) - y P(Ci N Cy) + y P(C: N Cj N Ck) — A ep POG.) 


pairs triplets 
= X P(C;) -X P(C¡NC;) + y P(C;nC;Q Ck) — ... + ep »(f ci). 
i=l i<j i<j<k i=1 


Notice that there are n major terms. The first term deals with probabilities of individual events; 
the second term deals with pairs; the third with triplets; and the sequence continues until a single 
final term involving a single intersection is reached. The @’th term has (7) summands. For example, 
there are (5) pairs, (3) triplets, etc. Notice also the alternating signs via (—1)°~!. It is possible to 
conceptually see the validity of this formula for the case of n = 3 by drawing a Venn diagram and 


seeing the role of all summands. In this case, 
P(C1UC3UC3) = P(Ci) +P(C2) +P(C3) —P(C\NC2) —P(C1n03) —P(C2NC3) +P(C{NC2NC3). 


Let us now consider a classic example that uses this inclusion exclusion principle. Assume that a 
secretary has an equal number of pre-labelled envelopes and business cards, n. Suppose that at the 
end of the day, he is in such a rush to go home that he puts each business card in an envelope at 
random without any thought of matching the business card to its intended recipient on the envelope. 
The probability that each of the business cards will go to the correct envelope is easy to obtain. It 
is 1/n!, which goes to zero very quickly as n grows. However, what is the probability that each of 
the business cards will go to a wrong envelope? 


As an aid, let A; be the event that the ith business card is put in the correct envelope. We 
have a handle on events involving intersections of distinct A; values. For example, if n = 10, then 
P(A, N A4 N Ag) = 7!/10!, or more generally, the probability of an intersection of k such events is 
pk := (n — k)!/n!. 
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The event we are seeking to evaluate is, B= At N ASN...9 Af. Hence by De Morgan’s laws, 
Bo = A, U... U An. Hence using the inclusion exclusion formula together with pz, we can simplify 
factorials and binomial coefficients to obtain: 


n n 7 _4)\k Lr 
P(B) =1—P(A,U...U An) 21— Nay) -1-M En 223 rus (2.4) 
k=1 k=1 ` k=0 ` 





Observe that as n — oo this probability converges to 1/e ~ 0.3679, yielding a simple asymptotic 
approximation. Listing evaluates P(B) in several alternative ways for n = 1,2,...,8. The 
function bruteSetsProbabilityAllMiss() works by creating all possibilities and counting. 
Although a highly inefficient way of evaluating P(B), it is presented here as it is instructive. The 
function formulaCalcAllMiss () evaluates the analytic solution from (2.4). Finally, the function 
mcAllMiss () estimates the probability via Monte Carlo simulation. 





Listing 2.8: [Secretary with envelopes 


using Random, StatsBase, Combinatorics 
Random.seed! (1) 


function bruteSetsProbabilityAllMiss (n) 
omega = collect (permutations (1:n)) 
matchEvents = [] 
for a Ebel Al Sig 

event = [] 
for p in omega 
if p[i] == i 
push! (event, p) 
end 








end 
push! (matchEvents, event) 
end 
noMatch = setdiff (omega, union (matchEvents...)) 
return length (noMatch) /length (omega) 
end 








formulaCalcAllMiss(n) = sum([(-1)^k/factorial(k) for k in 0:n]) 


function mcAllMiss (n,N) 
function envelopeStuffer() 
envelopes - Random.shuffle! (collect (1:n)) 
return sum([envelopes[i] == i for i in 1:n]) == 0 
end 
data = [envelopeStuffer() for _ in 1:N] 
return sum(data)/N 











end 
N = 10^6 


println("nNtBrute Force\tFormula\t\tMonte Carlo\tAsymptotic", ) 
for nmn in 1:6 
bruteForce = bruteSetsProbabilityAllMiss (n) 
fromFormula = formulaCalcAllMiss (n) 
fromMC = mcAllMiss (n,N) 
println(n,"Nt",round(bruteForce,digits-4), NtNt",round(fromFormula,digits-4), 
"\t\t", round (fromMC, digits=4),"\t\t", round(1/MathConstants.e, digits=4) ) 
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n Brute Force Formula Monte Carlo Asymptotic 
1 0.0 0.0 0.0 0.3679 
2 0.5 05:5 0.4994 0.3679 
3 0.3339 0.3333 0.3337 0.3679 
4 0.375 0.375 0.3747 0.3679 
5 0.3667 0.3667 0.3665 0.3679 
6 0.3681 0.3681 0.3678 0.3679 





Lines 4-18 define the function bruteSetsProbabilityAllMiss (), which uses a brute force ap- 
proach to calculate P(B). The nested loops in lines 7-15 populate the array matchEvents with 
elements of omega that have a match. The inner loop in lines 9-13, puts elements from omega in 
event if they satisfy an i'th match. In line 16, notice the use of the 3 dots splat operator, .... 
Here union() is applied to all the elements of matchEvents. The return value in line 17 is a 
direct implementation via counting the elements of noMatch. The function on line 20 implements 
in straightforward manner. Lines 22-29 implement the function mcAllMiss() that estimates 
the probability via Monte Carlo. The inner function, envelopeStuffer() returns a result from a 
single experiment. Note that shuffle! () is used to create a random permutation in line 24. The 
remainder of the code prints the output, and compares the results to the asymptotic formula obtained 
via 1/MathConstants.e. 














An Occupancy Problem 


We now consider a problem related to the previous example. Imagine now the secretary placing 
r identical business cards randomly into n envelopes, with r > n and no limit on the number of 
business cards that can fit in an envelope. We now ask what is the probability that all envelopes 
are non-empty (i.e. occupied)? 


To begin, denote A; as the event that the ?'th envelope is empty, and hence Aj is the event 
that the ¿"th envelope is occupied. Hence as before, we are seeking the probability of the event 
B-—AfnA$n...n AS. Using the same logic as in the previous example, 


P(B) =1-—P(41U...UAn) 
i n 
—1- —1 k+1 ~ 
ni (an 
k=1 
where pi, is the probability of at least k envelopes being empty. Now from basic counting consider- 


ations, 
a n — k)" kN 


n" n 





'Thus we arrive at, 


i n kV I n BN" 
P(B) =1- 5 (14 1-1) =>» (-1* 1-2]. 2. 
(B -1- Da (1-4) 2 Ye (1 (2.5) 

k=1 k=0 
We now calculate P(B) in Listing [2.9]and compare the results to Monte Carlo simulation estimates. 
In the code we consider several situations by varying the number of envelopes in the range n = 
1,...,100, and for every n, consider the number of business cards r = Kn for K = 2,3,4. The 
results are displayed in Figure [2.4] 
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Figure 2.4: Analytic and estimated probabilities that no envelopes are empty, 
for various cases of n envelopes, and Kn business cards. 


Listing 2.9: |An occupancy problem 


using Plots ; pyplot() 


occupancvAnaulytbice yr snm CS RA Rx atm ormai ny re efor oan omi 


function occupancyMC (n,r,N) 
illa = 0) 
for in Nh 
nvelopes = zeros (Int,n) 
for k in 1:r 
target = rand(1:n) 
nvelopes[target] += 1 








end 
numFilled = sum(envelopes .> 0) 
if numFilled == n 
fullCount += 1 
end 
end 
return fullCount/N 





end 


max n, N, Kvals = 100, 10°3, [2,23,4] 


analytic = [[occupancyAnalytic(big(n),big(k«n)) for n in 1:max n] for k in Kvals] 
= [[occupancyMC(n,k«n,N) for n in 1:max n] for k in Kvals] 





monteCarlo 





plot(1:max n, analytic, c=[:blue :red :green], 
label-["K-2" "K-3" "K-4"]) 
scatter! (l:max_n, monteCarlo, mc=:black, shape=:+, 
label="", xlims=(0,max_n),ylims=(0,1), 
xlabel="n Envelopes", ylabel="Probability", legend=:topright) 
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In line 3 we create the function occupancyAnalytic (), which evaluates (2.5). Note the use of the 
binomial () function. Lines 5-19 define the function occupancyMC (), which approximates P(B) 
for specific inputs via Monte Carlo simulation. Note the additional argument N, which is the total 
number of simulation runs. Line 5 defines the variable full count, which represents the total number 
of times all envelopes are full. Lines 7-17 contain the core logic of this function, and represent the 
act of the secretary assigning all business cards randomly to the envelopes, and repeating this process 
N times total. Observe that in this for loop, there is no need to keep a count of the loop iteration 
number, hence for clarity we use underscore in line 7. Line 13 checks each element of envelopes to 
see if they are empty (i.e 0), and evaluates the total number of envelopes which are not empty. Note 
the use of element-wise comparison .>, resulting in an array of boolean values that can be summed. 
Lines 14-16 checks if all envelopes have been filled, and if so increments fullCount by 1. In lines 23 
and 24 we create analytic and monteCarlo respectively. Each of these is an array of arrays, with 
an internal array for k=2, k=3 and k=4. The results are then plotted. 





2.3 Independence 


We now consider independence and independent events. Two events, A and B, are said to be 
independent if the probability of their intersection is the product of their probabilities: 


P(AN B) = P(A)P(B). 


A classic example is a situation where a random experiment involves physical components that 
are assumed to not interact, for example flipping two coins. Independence is often a modeling 
assumption and plays a key role in many models presented in the remainder of the book. 


Note that “independent events” should not be confused with “disjoint events”. However, these 
concepts are completely different. Take disjoint events A, and B, with P(A) > 0 and P(B) > 0. 
This means that P(A)P(B) > 0. It is easy to see that the events are not independent. Since they 
are disjoint, A N B = ( and P(0) = 0, however, 


0 = P(0) = P(ANB) # P(A)P(B). 


To explore independence, it is easiest to consider a situation where it does not hold. Consider 
drawing a number uniformly from the range 10,11,...,25. What is the probability of getting the 
number 13? Clearly there are 25 — 10+ 1 = 16 options, and hence the probability is 1/16 = 0.0625. 
However, the event of obtaining 13 could be described as the intersection of the events A := 
{first digit is 1} and B :— {second digit is 3). The probabilities of which are 10/16 = 0.625 and 
2/16 — 0.125 respectively. Notice that the product of these probabilities is not 0.0625, but rather 
20/256 = 0.078125. Hence we see that, P(AB) 4 P(A)P(B) and the events are not independent. 


One way of viewing this lack of independence is as follows. Witnessing the event A gives us 
some information about the likelihood of B. Since if A occurs, we know that the number is in the 


range 10,...,19 and hence there is a 1/10 chance for B to occur. However, if A does not occur then 
we lie in the range 20,...,25 and there is a 1/6 chance for B to occur. 
If however we change the range of random digits to be 10,...,29 then the two events are 


independent. This can be demonstrated by running Listing [2.10] and then modifying line 4. 
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Listing 2.10: [Independent events 


using Random 
Random. seed! (1) 


numbers = 10:25 
= 10^7 


ns Doe = rone, alo o (6:77 10) )) 
secondDigit (x) = x%10 


oo NOONAN 


numThirteen, numFirstIsOne, numSecondIsThree = 0, 0, O 


for L in IN 
X = rand (numbers) 
global numThirteen += X == 13 
global numFirstIsOne += firstDigit (X) == 
global numSecondIsThree += secondDigit (X) == 3 
end 


probThirteen, probFirstIsOne, probSecondIsThree = 
(numThirteen,numFirstIsOne,numSecondIsThree)./N 





println("P(13) = ", round(probThirteen, digits-4), 
"\nP(1_) = ",round(probFirstIsOne, digits-4), 
"\nP(_3) = ", round(probSecondIsThree, digits-4), 
"AnP(1 )«P( 3) = ",round(probFirstIsOne«probSecondIsThree, digits=4) ) 





P(13) = 0.0626 
P(1) = 0.6249 
BP(.3) = 0.1252 
P(1_)*P(_3) = 0.0783 





Lines 4 and 5 set the range of numbers considered and the number of simulation runs respectively. 
Line 7 defines a function that returns the first digit of our number through the use of the floor () 
function, and converts the resulting value to an integer type. Line 8 defines a function that uses the 
modulus operator % to return the second digit of our number. In line 10 we initialize three placeholder 
variables, which represent the number chosen, and its first and second digits respectively. Lines 12-17 
contain the core logic of this example, where N random digits are generated. For each random digit, X 
that is generated, lines 14, 15 and 16 increment the count by 1 if the specified condition is met. Line 
19-20 evaluate the total proportions. 





2.4 Conditional Probability 


It is often the case that knowing an event has occurred, say B, modifies our belief about the 
chances of another event occurring, say A. This concept is captured via the conditional probability of 
A given B, denoted by P(A | B) and defined for B where P(B) > 0. In practice, given a probability 
model, P(-), we construct the conditional probability, P(- | B) via, 


P(AN B) 


P(A | B):= ug; 


(2.6) 
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This immediately shows that if events A and B are independent then P(A | B) = P(A). 


As an elementary example, refer back to Table depicting the outcome of rolling two dice. 
Set B as the event of the sum being greater than or equal to 10. In other words, 


B={(i,9) | i +j > 10}. 


To help illustrate this further, consider a game player who rolls the dice without showing us the 
result, and then poses to us the following: “The sum is greater or equal to 10. Is it even or odd?”. 
Let A be the event of the sum being even. We then evaluate, 






































. ANB P(Sum is 10 or 12 4/36 2 

P(A | B) = P(Sum is even | B) = UD ) = ae TX m = a =>) 
c : AAB P(Sum is 11 2/36 1 
A AO $5 = m TES n = ias =3 











It can be seen that given B, it is more likely that A occurs (even) as opposed to A“ (odd), hence 
we are better off answering “even”. 


The Law of Total Probability 


Often our probability model is comprised of conditional probabilities as elementary building 


blocks. In such cases, [(2.6)|is better viewed as, 
P(AN B) =P(B) P(A | B). 


This is particularly useful when there exists some partition of Q, namely, (B1, Bo,...}. A partition 
of a set U is a collection of non-empty sets that are mutually disjoint and whose union is U. Such 
a partition allows us to represent A as a disjoint union of the sets AN By, and treat P(A | Bj) as 
model data. In such a case, we have the law of total probability, 


oo 


P(A) = Y P(AN By) = Y P(A | Bx) P(B). 
k=0 k=0 


As an exotic fictional example, consider the world of semi-conductor manufacturing. Room clean- 
liness in the manufacturing process is critical, and dust particles are kept to a minimum. Let A be 
the event of a manufacturing failure, and assume that it depends on the number of dust particles 
via, 


1 
k+l? 
where B; is the event of having k dust particles in the room (k = 0,1,2,...). Clearly the larger k, 
the higher the chance of manufacturing failure. Furthermore assume that, 


P(A | By) =1 


6 


P(B) = TED? 


for k=0,1,.... 


From the well known Basel Problem, we have XZL} k ? = 1?/6. This implies that X`, P(B,) = 1. 
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Now we ask, what is the probability of manufacturing failure? The analytic solution is given by, 





P(A) = 2 | By) P(Br) = >> (1 = 7 SUY 


k=0 


With some calculus, the infinite series can be explicitly evaluated to, 


L: 819) 


P(A) = 1 — 37 ~ 0.2692, 


where ¢(-) is the Riemann Zeta Function, 


(9-3. 
n=1 


and ¢(3) = 1.2021. Note that the appearance of ¢(-) in this example is by design due to the fact that 
we chose P(A | By) and P(B;) to have the specific structure. Listing approximates the infinite 
series numerically (truncating at n = 2000) and compares the result to the analytic solution. 


Listing 2.11: Defects in manufacturing 


using SpecialFunctions 
2000 


probAgivenB(k) = 1- 1/(k+1) 
probB(k) = 6/ (pix (k+1))%2 


numerical= sum([probAgivenB(k)*probB(k) for k in 0:n]) 
analytic = 1 - 6*zeta(3)/pi^2 


println("Analytic: ", analytic, "\tNumerical: ", numerical) 





Analytic: 0.26923703059856086 Numerical: 0.26893337073278945 





This listing is self-explanatory, however note the use of the Julia function zeta() from the 
SpecialFunctions package in line 9. Note also that pi is a defined constant. 


2.5 Bayes’ Rule 


Bayes’ rule, also known as Bayes’ theorem, is nothing but a simple manipulation of|(2.6)|yielding, 


P(B | A)P(A) 


P(A |B) = — gs 


(2.7) 
However, the consequences are far reaching. Often we observe a posterior outcome or measurement, 
say the event B, and wish to evaluate the probability of a prior condition, say the event A. That 
is, given some measurement or knowledge we wish to evaluate how likely is it that a prior condition 


occurred. Equation |(2.7)| allows us to do just that. 


68 CHAPTER 2. BASIC PROBABILITY - DRAFT 


Was it a Jora 1? 


As an example, consider a communication channel involving a stream of transmitted bits (0’s 
and 1's), where 70% of the bits are 1, and the rest 0. A typical snippet from the channel 
...0101101011101111101.... 


The channel is imperfect due to physical disturbances such as interfering radio signals, and 
furthermore the bits received are sometimes distorted. Hence there is a chance (£o) of interpreting 
a bit as 1 when it is actually 0, and similarly, there is a chance (&1) of interpreting a bit as 0 when 
it is actually 1. 


Now say that we received (Rx) a bit, and interpreted it as 1. This is the posterior outcome. 
What is the chance that it was in-fact transmitted (Tx) as a 1? Applying Bayes’ rule: 


| P(Rx1|Tx DP(Ix 1 .  (1—6&)0.7 
ASI P(Rx 1) 0.70 — = + 0.380 ca 








For example, if £g = 0.1 and £1 = 0.05 we have that P(Tx 1 | Rx 1) = 0.9568. Listing illustrates 
this via simulation. 


Listing 2.12: [Tx Rx Bayes 


using Random 
Random.seed! (1) 


= 10^5 
PEODIN = We 7 
eps0, epsl = 0.1, 0.05 


0 DARAN A 


ilione e nerolo (OIE, ISO) = wemel()) < peda ? or obel) s lote 


TxData = rand(N) .< probl 
RxData [x == 0 ? flipwithProb(x,eps0) : flipWithProb(x,epsl1) for x in TxData] 


numTx1 = 0 
totalRx1 = 
for i in I:N 


0 


if RxData[i] == 
global totalRx1 += 1 
global numTx1 += TxData[i] 
end 
end 


monteCarlo = numTx1/totalRx1 
analytic = ((1-eps1)*0.7)/((1-eps1)*0.740.3xeps0) 


primem (Monte Cardo a montecarlo, tA aiite c anal tiles) 





Monte Carlo: 0.9576048007598325 Analytic: 0.9568345323741007 
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Door 1 Door 2 Door 3 


























Figure 2.5: Monty Hall: If the prize is behind Door 2 and Door 1 is chosen, 
the game show host must reveal Door 3. 





In lines 8 the function flipWithProb() is defined. It uses the xor () function to randomly flip 
the input argument bit, according to the rate given by the argument prob. Line 10 generates the 
array TxData, which contains true and false values representing our transmitted bits of 1’s and 0’s 
respectively. It does this by uniformly and randomly generating numbers on the range [0, 1], and then 
evaluating element-wise if they are less than the specified probability of receiving a 1, prob1. Line 11 
generates the array RxData, which represents our simulated received data. First the type of received 
bit is checked, and the flipWithProb() function is used to flip received bits at the rates specified 
in line 6 if the received bit is a O or 1. Lines 13-20 are used to check the nature of all bits. If the bit 
received is 1, then it increments the counter totalRx1 by 1. It also increments the counter numTx1 
by the value of the transmitted bit (which may be 1, but could also be 0). The remaining lines then 
calculate the Monte Carlo based estimate and compare to the analytic solution from (2.8). 





The Monty Hall Problem 


The Monty Hall problem is a famous problem which was first posed and solved in 1975 by the 
mathematician Steve Selvin |[SBK”75]. It is a famous example illustrating how probabilistic reasoning 
may sometimes yield to surprising results. 


Consider a contestant on a television game show, with three doors in front of her. One of the 
doors contains a prize, while the other two are empty. The contestant is then asked to guess which 
door contains the prize, and she makes a random guess. Following this, the game show host (GSH) 
reveals an empty (losing) door from one of the two remaining doors not chosen. The contestant 
is then asked if she wishes to stay with their original choice, or if she wishes to switch to the 
remaining closed door. Following the choice of the contestant to stay or switch, the door with the 
prize is revealed. The question is: should the contestant stay with their original choice, or switch? 
Alternatively, perhaps it doesn’t matter. 


For example, in Figure [2.5] we see the situation where the hidden prize is behind door 2. Say the 
contestant has chosen door 1. In this case, the GSH has no choice but to reveal door 3. Alternatively, 
if the contestant has chosen door 2, then the GSH will reveal either door 1 or door 3. 
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The two possible policies (or strategies of play) for the contestant are: 


Policy I - Stay with their original choice after the door is revealed. 


Policy II- Switch after the door is revealed. 


Let us consider the probability of winning for the two different policies. If the player adopts 
Policy I then she always stays with her initial guess regardless of the GSH action. In this case, her 
chance of success is 1/3; that is, she wins if her initial choice is correct. 


However if she adopts Policy II then she always switches after the GSH reveals an empty door. 
In this case we can show that her chance of success is 2/3; that is, she actually wins if her initial 
guess is incorrect. This is because the GSH must always reveal a losing door. If she originally chose 
a losing door, then the GSH must reveal the second losing door every time (otherwise he would 
reveal the prize). That is, if the player chooses an incorrect door at the start, the non-revealed door 
will always be the winning door. The chance of such an event is 2/3. 


As a further aid for understanding imagine a case of 100 doors and a single prize behind one of 
them. In this case assume that the player chooses a door, for example door 1, and following this 
the GSH reveals 98 losing doors. There are now only two doors remaining, her choice door 1, and 
(say for example), door 38. The intuition of the problem suddenly becomes obvious. The player’s 
original guess was random and hence door 1 had a 1/100 chance of containing the prize, however 
the GSH’s actions were constrained. He had to reveal only losing doors, and hence there is a 99/100 
chance that door 38 contains the prize. Hence, Policy II is clearly superior. 


We now analyze the case of 3 doors by applying Bayes’ theorem. Let A; be the event that the 
prize is behind door i. Let Bj be the event that door i is revealed by the GSH. Then, for example, 
if the player initially chooses door 1 and then the GSH reveals door 2, we have the following: 


ii 
P(B2 | AyP(Ai) 2^3 1 











x 
P(A, | Ba) = P(B») = i = 3? (Policy I) 
2 
1x l 
P( Bz | A3)P(A s. 
P(A; | Bo) = ( aus ES. pe > (Policy II) 
2 


In the second case note that P(B2 | A3) = 1 because the GSH must reveal door 2 if the prize is 
behind door 3 since door 1 was already picked. Hence, we see that while neither policy guarantees 
a win, Policy II clearly dominates Policy I. 


Now that we have shown this analytically, we perform a Monte Carlo simulation of the Monty 


Hall problem in Listing [2.13] 
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Listing 2.13: |The Monty Hall problem 


using Random 
Random. seed! (1) 


function montyHall (switchPolicy) 
prize, choice = rand(1:3), rand(1:3) 
if prize == choice 
revealed = rand (setdiff (1:3,choice)) 
else 
revealed = rand(setdiff(1:3, [prize,choice])) 
end 


if switchPolicy 

choice = setdiff(1:3, [revealed, choice]) [1] 
end 
return choice == prize 


= 10^6 
prine door ( P eabrexer eis TonereloeloslLabesz mo enoo lhe Sa) 
sum([montyHall(false) for _ in 1:N])/N) 
print (CU Sbte er sis- jorelocloiiliey web ole y Sis e) 
sum([montyHall(true) for _ in 1:N])/N) 





Success probability with policy I (stay): 0.332913 
Success probability with policy II (switch): 0.667027 





In lines 4-16 the function mont yHall() is defined, which performs one simulation run of the problem 
given a policy, with false indicating policy I and t rue indicating policy II (switching). At the start 
of the game, the location of the prize and the player's door choice are uniformly and randomly 
initialized. Lines 6-10 contain the logic and action of the GSH. Since he knows the location of both the 
prize and the chosen door, he first mentally checks if they are the same. If they are, he reveals a door 
according to line 7. If not, then he proceeds to reveal a door according to the logic in line 9. In either 
case, the revealed door is stored in the variable revealed. Line 7 represents his action if the initial 
choice door is the same as the prize door. In this case, he is free to reveal either of the remaining 
two doors, i.e. the set difference between all doors and the player's choice door. In this case the 
set difference has 2 elements. Line 9 represents the GSH action if the choice door is different to 
the prize door. In this case, his hand is forced. As he cannot reveal the player's chosen door or the 
prize door, he is forced to reveal the one remaining door, which can be thought of as the set difference 
between 1:3 (all doors) and [prize, choice]. In this case the set difference has a single element. 
Line 13 represents the contestant's action, after the GSH revelation, based on either a switch (true) 
or stay (false) policy. If the contestant chooses to stay with her initial guess (false), then we skip 
to Line 15. However, if she chooses to swap (true), then we reassign our initial choice to the one 
remaining door in line 13. Note the use of [1], which is used to assign the value from the array to 
choice, rather than the array itself. Line 15 checks if the player's choice is the same as the prize, 
and returns true if she wins, or false if she loses. Lines 19-22 repeat this experiment N times for 
each of the policies and print the Monte Carlo estimates. 





Chapter 3 


Probability Distributions - DRAFT 


In this chapter, we introduce random variables, different types of distributions and related con- 
cepts. In the previous chapter we explored probability spaces without much emphasis on numerical 
random values. However, when carrying out random experiments, there are almost always numer- 
ical values involved. In the context of probability, these values are often called random variables. 
Mathematically, a random variable X is a function of the sample space, 2, and takes on integer, 
real, complex, or even a vector of values. That is, for every possible outcome w € 2, there is some 
possible outcome, X (w). 


The chapter is organized as follows: In Section [3-1]we introduce the concept of a random variable 
and its probability distribution. In Section .2]we introduce the mean, variance and other numerical 
descriptors of probability distributions. In Section we explore several alternative functions for 
describing probability distributions. In Section [3.4] we focus on Julia’s Distributions package 
which is useful when working with probability distributions. Then Section B.5] explores a variety of 
discrete distributions. This is followed by Section[3.6| where we explore some continuous distributions 
together with additional concepts such as hazard rates and more. We close with Section B.7] where 
we explore multi-dimensional probability distributions. 


3.1 Random Variables 


As an example, consider a sample space (2 which consists of 6 names. Assume that the probability 
function (or probability measure), P(-), assigns uniform probabilities to each of the names. Let now, 
X : Q > Z, be the function (i.e. random variable) that counts the number of letters in each name. 
'The question is then finding: 


plz) := P(X = x), for z EZ. 


The function p(x) represents the probability distribution of the random variable X. In this case, 
since X measures name lengths, X is a discrete random variable, and its probability distribution 
may be represented by a Probability Mass Function (PMF), such as p(x). 


To illustrate this, we carry out a simulation of many such random experiments, yielding many 
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Figure 3.1: A discrete probability distribution taking values on (3,4, 5, 6,8). 


replications of the random variable X, which we then use to estimate p(x). This is performed in 
Listing below. 


Listing 3.1: |A simple random variable 


using StatsBase, Plots; pyplot() 


names — ["Mary","Mel","David","John","Kayley","Anderson"] 
randomName() = rand(names) 

X = Ss 

N = 10^6 

sampleLengths = [length(randomName()) for _ in 1:N] 


bar (X, counts (sampleLengths)/N, ylims=(0,0.35), 
xlabel="Name length", ylabel="Estimated p(x)", legend=:none) 


O dw o0o0-I1O0» CO KRWN HE 


E 











In line 3 we create the array names, which contains names with different character lengths. Note that 
two names have four characters, namely “Mary” and “John”, while there is no name with 7 characters. 
In line 4, we define the function randomName () which randomly selects, with equal probability, an 
element from the array names. In line 5, we specify that we will count names of 3 to 8 characters 
in length. Line 6 specifies how many random experiments of choosing a name we will perform. Line 
7 uses a comprehension and the function length () to count the length of each random name, and 
stores the results in the array sampleLengths. Here the Julia function length () is the analog 
of the random variable. That is, it is a function of the sample space, Q, yielding a numerical value. 
Line 9 uses the function counts () to count how many words are of length 3, 4, up to 8. The bar () 
function is then used to plot a bar-chart of the proportion of counts for each word length. Two key 
observations can be made. It can be seen that words of length 4 occurred twice as much as words of 
lengths 3, 5, 6 and 8. In addition, no words of length 7 were selected, as no name in our original array 
had a length of 7. 
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Figure 3.2: Three different examples of probability distributions. 


Types of Random Variables 


In the previous example, the random variable X took on discrete values and is thus called a 
discrete random variable. However, quantities measured in nature are often continuous, in which 
case a continuous random variable better describes the situation. For example, the weights of people 
randomly selected from a big population. 


In describing the probability distribution of a continuous random variable, the probability mass 
function, p(x), is no longer applicable. This is because for a continuous random variable X, P(X = 
x) for any particular value of x is 0. Hence in this case, the Probability Density Function (PDF), 
f(x) is used, where, 

f(z)A = P(a<X<a+A). 


Here the approximation becomes exact as A > 0. Figure [3.2] illustrates three examples of proba- 
bility distributions. The one on the left is discrete and the other two are continuous. 


The discrete probability distribution appearing on the left in Figure can be represented 
mathematically by the probability mass function 


0.25 for z = 0, 
plx) =< 0.25 for x = 1, (3.1) 
0.5 for x = 2. 


The smooth continuous probability distribution is defined by the probability density function, 
3 2 
f(z) = 40-27) fo —1zzxl. 
Finally, the triangular probability distribution is defined by the probability density function, 


rcl for x € [71,0], 
fala) = 
l-g for x € (0, 1]. 


Note that for both the probability mass function and the probability density function, it is implicitly 
assumed that p(x) and f(x) are zero for x values not specified in the equation. 
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It can be verified that for the discrete distribution, 
Y p(z) = 1, 
x 
and for the continuous distributions, 
oo 
f fi(x£)dx = 1 for 4 = 1,2. 
—oo 


There are additional descriptors of probability distributions other than the PMF and PDF, and 
these are further discussed in Section [3.3] Note that Figure [3.2] was generated by Listing [3.2] below. 


Listing 3.2: |Plotting discrete and continuous distributions 


using Plots, Measures; pyplot () 


pDiserece — 0m. (0525. 1055] 
SE = Ug 


pContinuous (x) = 3/4«(1 - x^2) 
ERIC = =120),@ie i 


COME TMUOUS2 (5x) = x < 0 ? seri g lex 











pl = plot(xGridD, line=:stem, pDiscrete, marker=:circle, c=:blue, ms=6, msw=0) 
p2 = plot (xGridC, pContinuous. (xGridC), c=:blue) 
p3 plot (xGridC, pContinuous2.(xGridC), c=:blue) 


plot (p1, p2, p3, layout=(1,3), legend=false, ylims=(0,1.1), xlabel="x", 
ylabel=["Probability" "Density" "Density"], size=(1200, 400), margin=5mm) 





In line 3 we define an array specifying the PMF of our discrete distribution, and in lines 6 and 9 we 
define functions specifying the PDFs of our continuous distributions. In lines 11-16 we create plots of 
each of our distributions. Note that in the discrete case we use the line=: stem argument together 
with marker=:circle. 





3.2 Moment Based Descriptors 


The probability distribution of a random variable fully describes the probabilities of the events, 
(w EQ : X(w) € A}, for all sensible A C IR. However, it is often useful to describe the nature of 
a random variable via a single number or a few numbers. The most common example of this is the 
mean which describes the center of mass of the probability distribution. Other examples include 
the variance and moments of the probability distribution. We expand on these now. 


Mean 


The mean, also known as the expected value of a random variable X, is a measure of the central 
tendency of the distribution of X. It is represented by E[X], and is the value we expect to obtain 














3.2. MOMENT BASED DESCRIPTORS TT 


"on average" if we continue to take observations of X and average out the results. The mean of a 
discrete distribution with PMF p(z) is 














E[X] = S pla). 


In the example of the discrete distribution given by |(3.1)]it is, 
E[X] 20x 0.25 + 1x0.25 + 2x 0.5 — 1.25. 














The mean of a continuous random variable, with PDF f(z) is 
oo 

E[X] 2 x fom dz, 
—oo 


which in the examples of f;(-) and fa(-) from Section [3.1| yield, 


1 
3 2 
“(1—2?) =0 
Paz x“) ? 


0 1 
f +1 dx + ] 172 dx = 0, 
-1 0 


respectively. As can be seen, both continuous distributions have the same mean even though their 
shapes are different. For illustration purposes, we now carry out this integration numerically in 


Listing 














and, 


Listing 3.3: |Expectation via numerical integration 


using QuadGK 

suo = (=1, i) 

fl (x) = 3/4*(1-x^2) 

ras) e xx < 0 Y xd 8g dix 


expect (i, suppose) = queadg(( => x*i s suppostt 5] 


println("Mean 1: ", expect (f1, sup) ) 
println("Mean 2: ", expect (f2, sup) ) 





Mean 1: 0.0 
Mean 2: -2.0816681711721685e-17 





In line 1 we specify usage of the QuadGK package, which contains functions that support one- 
dimensional numerical integration via a method called adaptive Gauss-Kronrod quadrature. In lines 4 
and 5 we define the PDF’s of the distributions via £1 () and £2 (). In line 7 we define the function 
expect () which takes two arguments, a function to integrate f, and a domain over which to inte- 
grate the function support. It uses the quadgk () function to evaluate the 1-dimensional integral 
given above. For this an anonymous function (x) -> x*f(x) is created. Note that the start and 
end points of the integral are support [1] and support [2] respectively. These are “splatted” into 
the second and third argument of quadgk () via the ‘...’ operator. Note also that the function 
quadgk () returns two arguments, the evaluated integral and an estimated upper bound on the ab- 
solute error. Hence [1] is included at the end of the function, so that only the integral is returned. 
Lines 9-10 then evaluate the numerical integrals of the functions £1 and £2 over the interval sup and 
display the output. As can be seen, both integrals are effectively evaluated to zero. 
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General Expectation and Moments 


In general, for a function h : IR —> R and a random variable X, we can consider the random 
variable Y := h(X). The distribution of Y will typically be different from the distribution of X. 
As for the mean of Y, we have, 


y h(x) p(x) for discrete, 











RIY] = E[h(X)] 








(3.2) 





| h(x) f(x)dx for continuous. 


Note that the above expression does not require explicit knowledge of the distribution of Y but 
rather uses the distribution (PMF or PDF) of X. 











A common case is h(x) = xf, in which case we call E[X^], the lth moment of X. Then, for a 
random variable X with PDF f(x), the Æ moment of X is, 

















E[x*] = P al f(x) dz. 


Note that the first moment is the mean and the zero'th moment is always 1. The second moment, 
is related to the variance as we explain below. 


Variance 


The variance of a random variable X, often denoted Var(X) or o? 


or dispersion, of the distribution of X. It is defined by, 


, is a measure of the spread, 












































Var(X) := E[(X — E[X])?] = E[X?] — (EKI). (3.3) 














Here we apply |(3.2)| by considering h(x) = (a — E[X])?. The second expression of (3.3) illustrates 
the role of the first and second moments in the variance. It follows from the first expression by 
expansion. 


For the discrete distribution, |(3.1)| we have: 
Var(X) = (0 — 1.25)? x 0.25 + (1— 1.25)? x 0.25 + (2 — 1.25)? x 0.5 = 0.6875. 


For the continuous distributions from Section [3.1] fi(-) and fa(-), with respective random variables 
X; and Xə, we have 

















1 3 571 

3 2 3/2 £ 

Var(X. = r? —(1 — a) dz — (E[X = | | 0 = 0.2, 

(X1) f 1 ) (E[X1]) 13-5, 

0 1 , 1 

a? (a+ 1) ax+ | a?(1 — x) dz — (E[X2]) = 
0 














Var(X2) = / 


—1 


The variance of X can also be considered as the expectation of a new random variable, Y :— 
(X — E[X])?. However, when considering variance, the distribution of Y is seldom mentioned. 
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Figure 3.3: Histograms for samples of the random variables X and Y. 


Nevertheless, as an exercise we explore this now. Consider a random variable X, with density, 


r—4 for x € [4,5], 
6-zx for x € (5, 6]. 


This density is similar to f2(-) previously covered, but with support [4,6]. In Listing[3.4] we generate 
random observations from X, and calculate data-points for Y based on these observations. We then 
plot both the distribution of X and Y, and show that the sample mean of Y is the sample variance 
of X. Note that our code uses some elements from the Distributions package, which is covered 
in more detail in Section 


Listing 3.4: |Variance of X as the mean of Y 


using Distributions, Plots; pyplot () 


dist = TriangularDist (4,6,5) 
= 10%6 

data = rand(dist,N) 

Data O mee 


println("Mean: ", mean(yData), " Variance: ", var (data)) 


pl histogram(data, xlabel="x", bins=80, normed=true, ylims=(0,1.1)) 
p2 = histogram(yData, xlabel="y", bins=80, normed=true, ylims=(0,15) ) 
plot (p1, p2, ylabel="Proportion", size=(800, 400), legend=:none) 





Mean (Y) = 0.16671191478072614 Variance(X) = 0.1667120530661165 
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Line 1 calls the Distributions package. This package supports a variety of distribution types 
through the many functions it contains. We expand further on the use of the Distributions 
package in Section [3.4] Line 2 uses the Triangular () function from the Distributions package 
to create a triangular distribution type object with a mean of 5 and a symmetric shape over the bound 
[4,6]. We assign this as the variable dist. In line 5 we generate an array of N observations from the 
distribution by applying the rand() function on the distribution dist. Line 6 takes the observations 
in data and from them generates observations for the new random variable Y. The values are stored 
in the array yData. Line 8 uses the functions mean() and var() on the arrays yData and data 
respectively. It can be seen from the output that the mean of the distribution Y is the same as the 
variance of X. Lines 10-12 are used to plot histograms of the data in the arrays data and yData. It 
can be observed that the histogram on the left approximates the PDF of our triangular distribution, 
while the histogram on the right approximates the distribution of the new variable Y. The distribution 
of Y is seldom considered when evaluating the variance of X. 





Higher Order Descriptors: Skewness and Kurtosis 


As described previously, the second moment plays a role defining the dispersion of a distribution 
via the variance. What about higher order moments? We now briefly define the skewness and 
kurtosis of a distribution utilizing the first three moments and first four moments respectively. 











Take a random variable X with E[X] = jj and Var(X) = o°, then the skewness, is defined as, 
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and the kurtosis is defined as, 
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Note that, y3 and ^4 are invariant to changes in location and scale of the distribution. 


'The skewness is a measure of the asymmetry of the distribution. For a distribution having a 
symmetric density function about the mean, we have y3 = 0. Otherwise, it is either positive or 
negative depending on the distribution being skewed to the right or skewed to the left respectively. 


'The kurtosis is a measure of the tails of the distribution. As a benchmark, any normal probability 
distribution (covered in detail in Section has y4 = 3. Then, a probability distribution with a 
higher value of ^4 can be interpreted as having ‘heavier tails’ (than a normal distribution), while a 
probability distribution with a lower value is said to have ‘lighter tails’ (than a normal distribution). 
This benchmark even yields a term called excess kurtosis defined as ^4 — 3. Hence, a positive excess 
kurtosis implies ‘heavy tails’ and a negative value implies ‘light tails’. 


Laws of Large Numbers 


Throughout this book, our Monte-Carlo experiments rely on laws of large numbers. This suite 
of mathematical statements claim that empirical averages converge to expected values. Stated as 
mathematical theorems, these laws come in different forms including the weak law of large numbers 
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and the strong law of large numbers. In both cases, a sequence of independent and identically 
distributed random variables, X1, X5,..., is considered. Then for each n, we compute the sample 


mean, 
1 n 
re 3a 
n a y k; 
k=1 

and consider the sequence of sample means. 

Xi, X2,.... 
If the mean of each of the random variables X; is u, then a law of large numbers is a claim that the 
sequence {X,,}°°, converge to u. The distinction between “weak” and “strong” lies with the mode of 
convergence. For example, the weak law of large numbers claims that the sequence of probabilities, 

Wn = PX. — u| > €), 


converges to 0 for any positive e. That is, as n grows, the likelihood of the sample mean X, 
to be farther away than e from the mean y vanishes. This is a statement about the sequence of 


probabilities, w1,wa,.... In contrast, the strong law of large numbers states that, 
P( lim X, = 1) =i, (3.4) 
n—>00 


This means that with certainty, every sequence of sample means converges to the expectation. 
From a practical perspective the implication is similar to the weak law of large numbers, however, 
mathematically the statement is different. In fact, the strong law of large numbers condition (3.4) 
implies the weak law of large numbers. 


It turns out that proving the weak law of large numbers is much easier than proving the strong 
law of large numbers. Also, for the strong law of large numbers, if we are willing to assume that 
E[X*] < oo then a proof isn't too difficult, however the minimal conditions are that E[X;] is 
finite, and under these conditions a proof is more involved. See for an introduction to such 
aspects of rigorous probability theory, including proofs. Also related is the example presented later 
in Listing [3.30] It deals with the Cauchy distribution and illustrates a scenario where the law of 
large numbers breaks because E|X;] does not exist. 






































Keep in mind that in many cases, we convert the sequence X1, X»,... into the sequence Jj, fa,... 
via, 
= J1 if X; satisfies some condition, 
j f if X; does not satisfy the condition. 


In such a case, 











E[I;] = P(X; satisfies the condition), 





and the average, 
1 TL 
n= de 
i— 


is the proportion of samples over 1,...,n that satisfy the condition. Here strong laws of large 
numbers (weak or strong) imply that empirical proportions converge to probabilities. 
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3.3 Functions Describing Distributions 


As alluded to in Section a probability distribution can be described by a probability mass 
function (PMF) in the discrete case, or a probability density function (PDF) in the continuous case. 
However, there are other popular descriptors of probability distributions, such as the cumulative 
distribution function (CDF), the complementary cumulative distribution function (CCDF), and the 
inverse cumulative distribution function. There are also transform-based descriptors including the 
moment generating function (MGF), probability generating function (PGF), as well as related func- 
tions such as the characteristic function (CF), or alternative names, including the Laplace transform, 
Fourier transform or z transform. Then, for non-negative random variables there is also the hazard 
function which we explore along with the Weibull distribution in Section The main point to 
take away here is that a probability distribution can be described in many alternative ways. We 
now explore a few of these descriptors. 


Cumulative Probabilities 


Consider first the CDF of a random variable X, defined as, 
F(x):= P(X €) 


where X can be discrete, continuous or a more general random variable. The CDF is a very popular 
descriptor because unlike the PMF or PDF, it is not restricted to just the discrete or just the 
continuous case. A closely related function is the CCDF, F(x) := 1 — F(x) = P(X > x). 


From the definition of the CDF, F(-), 


lim F(z) 0 and lim F(z) — 1. 


T—>—00 z—-00 


Furthermore, F(-) is a non-decreasing function. In fact, any function with these properties consti- 
tutes a valid CDF and hence a probability distribution of a random variable. 


In the case of a continuous random variable, the PDF f(-), and the CDF F(-), are related via, 


Tone L Fa) and F(x) = [ f (u) du. 


Also, as a consequence of the CDF properties, 
f(x) > 0, and / Jin dac. (3.5) 


Analogously, while less appealing than the continuous counter-part, in the case of discrete random 
variable, the PMF p(-) is related to the CDF via, 


p(x) = F(z)- lim F(t) and F(a) = > p(k). (3.6) 
k<a 


Note that here we consider p(x) to be 0 for x not in the support of the random variable. The 
important point in presenting |(3.5)| and |(3.6)| is to show that F(-) is a valid description of the 
probability distribution. 
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Figure 3.4: The CDF associated with the PDF fo(z). 


In Listing [3.5] we look at an elementary example, where we consider the PDF fo(-) of Section|3.1] 
and integrate it via a crude Riemann sum to obtain the CDF: 


F(a) =P Xx < x)= [ falu) du = y falu) Au. (3.7) 


u=— CO 


Listing 3.5: (CDF from the Riemann sum of a PDF 


using Plots, LaTeXStrings; pyplot () 


Ella = xe 2 xel less tabs s 1 200) 
ay la = 4.55, 2.5 
delta = 0.01 


F(x) = sum([£2(u)*delta for u in a:delta:x]) 


xGrid = a:delta:b 

y = [F(u) for u in xGrid] 

plot (xGrid, y, c=:blue, xlims=(a,b), ylims=(0,1), 
xlabel=L"x", ylabel-L"F(x)", legend=:none) 








In line 3 we define the function £2 (). The second set of brackets in the equation are used to ensure 
that the PDF is zero outside of the region [—1, 1], as it acts like anindicator function, and evaluates 
to 0 everywhere else. In line 4 and 5 we set the limits of our integral, and the stepwise delta used. 
In line 7 we create a function that approximates the value of the CDF through a crude Riemann sum 
by evaluating the PDF at each point u, multiplying this by delta, and repeating this process for 
each progressively larger interval up to the specified value x. The total area is then approximated 
via the sum() function. See In line 9 we specify the grid of values over which we will plot our 
approximated CDF. Line 10 uses the function F () to create the array y, which contains the actual 
approximation of the CDF over the grid of value specified. Lines 11-12 plot Figure 
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Inverse and Quantiles 


Where the CDF answers the question “what is the probability of being less than or equal to x”, 
a dual question often asked is “what value of x corresponds to a probability of the random variable 
being less than or equal to u”. Mathematically, we are looking for the inverse function of F(x). In 
cases where the CDF is continuous and strictly increasing over all values, the inverse, F7!(-) is well 
defined, and can be found via the equation, 


F(F-!(u)) =u, for u € [0, 1]. (3.8) 


For example, take the sigmoid function as the CDF, which is as a type of logistic function, 





1 
Fic 
e l +e 
Solving for F7! (u) in|(3.8)] yields, 
F~! (u) = log i-u 


Observe that as u > 0* we get F^!(u) — —oo and as u > 17 we get F~!(u) — oo. This is the 
inverse CDF for the distribution. Schematically, given a specified probability u, it allows us to find 
x values such that, 

P(X < we (3.9) 


The value x satisfying [(3.9)] is also called the u’th quantile of the distribution. If u is given as a 
percent, then it is called a percentile. The median is another related term, and is also known as the 
0.5’th quantile. Other related terms are the quartiles, with the first quartile at u = 0.25, the third 
quartile at u = 0.75 and the inter-quartile range, which is defined as F7+(0.75) — F-1(0.25). These 
same terms used again in respect to summarizing datasets in Section 


In more general cases, where the CDF is not necessarily strictly increasing and continuous, we 
may still define the inverse CDF via, 


F \(u) :=infíx : F(x) > u}. 


As an example of such a case, consider an arbitrary customer arriving to a queue where the server 
is utilized 80% of the time, and an average service takes 1 minute. How long does such a customer 
wait in the queue until service starts? Some customers won’t wait at all (20% of the customers), 
whereas others will need to wait until those that arrived before them are serviced. Results from 
the field of queueing theory (some of which are partially touched in Chapter give rise to the 
following distribution function for the waiting time: 


F(z)21—0.8€ 0-999 for y». (3.10) 


Notice that at x = 0, F(0) = 0.2, indicating the fact that there is a 0.2 chance for zero wait. 
Such a distribution is an example of a mixed discrete and continuous distribution. Notice that this 
distribution function only holds for a specific case of assumptions known as the stationary stable 
M/M/1 queue, explored further in Section [10.3] We now plot both F(x) and F-!(u) in Listing |3.6 
where we construct F7*(+) programmatically. Observe Figure [3.5] where the CDF F(a) exhibits a 
jump at 0 indicating the “probability mass”. The inverse CDF then evaluates to 0 for all values of 
u € [0, 0.2]. 
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Figure 3.5: The CDF F(z), and its inverse F^ 1(u). 


Listing 3.6: |The inverse CDF 


using Plots, LaTeXStrings; pyplot () 
seed = 080,018 10 

uGrid (0:9) 5 01 $1 

busy = 0.8 


BE) ES 7? 0 B l busyxexp(-(1-busy)t) 





infimum(B) = isempty(B) ? Inf : minimum(B) 
invF(u) = infimum(filter((x) -» (F(x) >= u),xGrid)) 


pi plot(xGrid,F.(xGrid), c=:blue, xlims-(-0.1,10), ylims-(0,1), 
xlabel-L"x", ylabel-L"F(x)") 


p2 plot(uGrid,invF.(uGrid), c=:blue, xlims-(0,0.95), ylims=(0,maximum(xGrid)), 
xlabel-L"u", ylabel=L"F*{-1} (u)") 








plot(pl, p2, legend=:none, size-(800, 400)) 











Line 3 defines the grid over which we will evaluate the CDF. Line 4 defines the grid over which we will 
evaluate the inverse CDF. In line 5 we define the time proportion during which the server is busy. In 
line 7 we define the function F () as in (3.10). Note that for values less than zero, the CDF evaluates 
to 0. In line 9 we define the function infimum(), which implements similar logic to the mathematical 
operation inf{}. It takes an input and checks if it is empty via the isempty () function, and if it is, 
returns Inf, else returns the minimum value of the input. This agrees with the typical mathematical 
notation where the infimum of the empty set is oo. In line 10 we define the function invF (). It first 
creates an array (representing a set) (x : F(x) > u} directly via the Julia filter () function. Note 
that as a first argument, we use an anonymous Julia function, (x) -» (F(x) >= u). We then 
use this function as a filter over xGrid. Finally we apply the infimum over this mathematical set 
(represented by a vector of coordinates on the x axis). Lines 12-18 are used to plot both the original 
CDF via the F () function, and the inverse CDF via the invF () functions respectively. 
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Integral Transforms 


In general terms, an integral transform of a probability distribution is a representation of the 
distribution on a different domain. Here we focus on the moment generating function (MGF). 
Other examples include the characteristic function (CF), probability generating function (PGF) 
and similar transforms. 














For a random variable X and a real or complex fixed value s, consider the expectation, E[e**]. 
When viewed as a function of s, this is the moment generating function. We present this here for 
such a continuous random variable with PDF f(-): 














M(s) = Eje*X] = f ” faye de: (3.11) 


This is also known as the bi-lateral Laplace transform of the PDF (with argument —s). Many useful 
Laplace transform properties carry over from the theory of Laplace transforms to the MGF. A full 
exposition of such properties are beyond the scope of this book, however we illustrate a few via an 
example. 


Consider two distributions with densities, 


fi(x) = 2x fo 2 € [0,1], 
falz) = 2 — 2x for x € [0,1], 


where the respective random variables are denoted X, and Xə. Computing the MGF of these 
distributions we obtain, 


1 
itrel 
mos oe nmaa n 
0 s? 


1 S 
A NE 
Ma(s) = f (2 — 22) 0% da = 25. 


Define now a random variable, Z = Xı + Xə where X4, and X» are assumed independent. In 
this case, it is known that the MGF of Z is the product of the MGFs of X, and Xə. That is, 


Tn a l 


Mz(s) = Mi(s)M»(s) = 4 





(3.12) 


The new MGF Mz(-) fully specifies the distribution of Z. It also yields a rather straightforward 
computation of moments, hence the name MGF. A key property of any MGF M(s) of a random 
variable X is that, 
d" 
mad (s) 














= EX". (3.13) 


s=0 i 
This can be easily verified from|(3.11)| Hence to calculate the n'th moment, one can simply evaluate 
the derivative of the MGF at s = 0. Note that in certain cases, evaluating the limit of s — 0 is 





required. 


In Listing |3.7| we estimate both the PDF and MGF of Z and compare the estimated MGF to 
Mz(s) above. The listing also creates Figure [3.6] where on the right hand side plot it can be seen 
that the slope of the tangent line to the MGF at s = 0 is 1.0, in agreement with the mean. 
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Figure 3.6: Left: T'he estimate of the PDF of Z via a histogram. 
Right: The theoretical MGF in red vs. a Monte Carlo estimate in blue. 
The slope of the black line is the mean. 


Listing 3.7: |A sum of two triangular random variables 


using Distributions, Statistics, Plots; PYR OEN) 


distl = TriangularDist(0,1,1) 
Gal eye 2 TriangulacDasic (@, 1, 0) 
N-10^6 





Catal, data2 = rand(dist1,N), rand(dist2,N) 
dataSum = datal + data2 


mgf (s) = 4(1+(s-1) *MathConstants.e%s) * (MathConstants.e*s—-1-s) /s*4 


mgfPointEst (s) = mean([MathConstants.e” (s*z) for z in 
rand (dist1,20) + rand (dist2,20)]) 





= histogram(dataSum, bins=80, normed=:true, 
ylims=(0,1.4), xlabel="z", ylabel="PDF") 


Seride = —is@.O@igil 
p2 = plot(sGrid, mgfPointEst.(sGrid), c=:blue, ylims=(0,3.5)) 
¡92 plot! (sGrid, mgf.(sGrid), c=:red) 
p2 plot! ( [minimum(sGrid),maximum(sGrid)], 
[minimum(sGrid), maximum(sGrid)].+1, 
c=:black, xlabel="s", ylabel-"MGF") 





plot (p1, p2, legend=:none, size=(800, 400)) 
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In lines 3 and 4 we create two separate triangular distribution type objects distl and dist2, 
matching the densities fi(r) and fə(x) respectively. Note that the third argument of the 
TriangularDist () function is the location of the “peak” of the triangle (or the mode of the distri- 
bution). Distribution objects are covered further in Section [3.4] below. In line 7 we generate random 
observations from dist1 and dist2, and store these observations separately in the two arrays datal 
and data2 respectively. In line 8 we generate observations for Z by performing element-wise sum- 
mation of the values in our arrays datal and data2. In line 10 we implement the MGF function as 
in (3.12). In lines 12-13 we define the function mgfPointEst (), which crudely estimates the MGF 
at the point s. We purposefully only use 20 observations, each time estimating the sample mean of 
e*2 for a specified s. The remainder of the code uses the data and the defined functions to generate 
Figure Lines 21-23 plot the black line. 








3.4 Distributions and Related Packages 


As touched on previously in Listing [3.4] and Listing [3.7] Julia has a well developed package for 
distributions. The Distributions package allows us to create distribution type objects based 
on what family they belong to (more on families of distributions in Sections and [3.6]. 'These 
distribution objects can then be used as arguments for other functions, for example mean () and 
var(). Of key importance is the ability to randomly sample from a distribution using rand(). 
We can also use distributions with other functions including pdf (), cdf (), and quantile() to 
name a few. In addition, the built-in Statistics package as well as the StatsBase package 
contain many functions which have methods for distribution type objects. A useful paper describing 


the distributions package is |BAABLPP19]. 


Weighted Vectors 


In the case of discrete distributions of finite support, the StatsBase package provides the 
“weight vector" object via Weights (), which allows for an array of values, or outcomes, to be given 
probabilistic weights. This is also known as a probability vector. In order to generate observations 
we use the sample() function (from StatsBase) on a vector given its weights, instead of the 
rand () function. Note that an alternative is to use the Categorical distribution supplied via the 
Distributions package. Listing [3.8] below provides a brief example of the use of weight vectors. 





Listing 3.8: ¡Sampling from a weight vector 


using StatsBase, Random 
Random.seed! (1) 


grade = AN Wei. vonn Up "En ] 
weightVect = Weights([0.1,0.2,0.1,0.2,0.4]) 








= 10%6 
data = sample (grade, weightVect,N) 
[count (i->(i==g), data) for g in grade] /N 








5-element Array{Float64,1}: 
0.099901 
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0 
0 
0 
0 


.200248 
.099704 
.20068 

.399467 





In line 4 we define an array of strings “A” to “E”, which represent possible outcomes. In line 5 
we define their weights. Note the fact that Weights() is capitalized, signifying the fact that the 
function creates a new object. This type of function is known as a Constructor. Line 8 uses the 
function sample () to sample N observations from our array grade, according to the weights given 
by the weight vector weightVect. Line 9 uses the count () function to count how many times each 
entry g in grade occurs in data, and then evaluates the proportion of times total each grade occurs. 
It can be observed that the grades have been sampled according to the probabilities specified in the 
array weightVect. Note that you can also use the Categorical() object in the Distributions 
package as alternative. 





Using Distribution Type Objects 


We now introduce some important functionality of the Distributions package and distribu- 
tion type objects through an example. Consider a distribution from the “Triangular” family, with 
the following density, 


£ for x € [0,1], 
2-2 for x € (1, 2]. 


In Listing [3.9] rather than creating the density manually as in the previous sections, we use the 
TriangularDist () constructor to create a distribution type object, and then use this to create 
plots of the PDF, CDF and inverse CDF as in Figure [3.7] 


Listing 3.9: Using the pdf (), cdf (), and quantile () functions with Distributions 


using Distributions, Plots, LaTeXStrings; pyplot () 


chier = Weneimenuilaiewalsie (0), 2, 3L) 
xGrid = 0:0.01:2 
uGrid 030, 0181 


foul olori erich Texolt. Cist, xciricl), c=slolue, 
xlims=(0,2), ylims=(0,1.1), 
xlabel="x", ylabel="f (x)") 

p2 ¡AR O EN o «Era O A 
xlims=(0,2), ylims=(0,1), 
xlabel="x", ylabel="F (x)") 

pS plot( uGrid, quantile. (dist,uGrid), c=:blue, 
xlims=(0,1), ylims=(0,2), 
xlabel="u", ylabel-(L"F^(-1)(u)")) 


plot(pl, p2, p3, legend=false, layout=(1,3), size=(1200, 400)) 
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Figure 3.7: The PDF, CDF and inverse CDF a triangular distribution. 





In line 3 we use the TriangularDist () function to create a distribution type object. The first 
two arguments are the start and end points of the support, and the third argument is the location of 
the “peak” (or mode). The essence of this example is in lines 7, 10 and 13 where we use the paf (), 
cdf () and quantile () functions respectively. In each case we use dist as the first argument and 
broadcast over the second argument via the ‘.’ broadcast operator. 





In addition to evaluating functions associated with the distribution, we can also query a dis- 
tribution object for a variety of properties and parameters. Given a distribution object, you may 
apply params () on it to retrieve the distributional parameters. You may query for the mean (), 
median (), var () (variance), std, (standard deviation), skewness (), and kurtosis (). You 
can also query for the minimal and maximal value in the support of the distribution via minimun () 
and maximum () respectively. You may also apply mode () or modes () to either get a single mode 
(value of z where the PMF or PDF is maximized) or an array of modes where applicable. Listing[3.10] 
illustrates some of these for our TriangularDist. 


Listing 3.10: |Descriptors of Distribution objects 


using Distributions 
gll = TricmeulacDasic (0, 2, 1.) 


println("Parameters: \t\t\t",params (dist) ) 

println("Central descriptors: \t\t",mean(dist),"\t",median (dist) ) 
println("Dispersion descriptors: Nt", var(dist),"\t",std(dist) ) 

println("Higher moment shape descriptors: ",skewness(dist),"\t",kurtosis (dist) ) 
println("Range: \t\t\t\t", minimum(dist),"\t",maximum (dist) ) 

println("Mode: \t\t\t\t", mode(dist), "\tModes: ",modes (dist) ) 


1 
2 
3 
4 
5 
6 
y 
8 
9 





Parameters: (Q.D, 220; 1.0) 

Central descriptors: 1.0 1.0 

Dispersion descriptors: 0.16666666666666666 0.408248290463863 
Higher moment shape descriptors: 0.0 -0.6 

Range: 0.0 240 

Mode: 1.0 Modes: [1.0] 
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In Listing [3.11] we look at another example, where we generate random observations from a dis- 
tribution type object via the rand() function, and compare the sample mean against the specified 
mean. Note that two different types of distributions are created here, a continuous distribution and 
a discrete distribution. These are discussed further in Sections [3.5] and [3.6] respectively. 


Listing 3.11: [Using rand() with Distributions 


using Distributions, StatsBase, Random 
Random. seed! (1) 


distl = TriangularDist (0,10,5) 
dist2 = DiscreteUniform(1,5) 
theorMeanl, theorMean2 = mean (dist1), mean(dist2) 


1 
2 
3 
4 
5 
6 
Y 
8 


= 10^6 
datal = rand(distl,N) 
data2 = rand(dist2,N) 
stMeanl, stMean2 = mean (datal), mean (data2) 





println("Symmetric Triangular Distiribution on [0,10] has mean $theorMeanl 
(estimated: S$estMeanl)") 

println("Discrete Uniform Distiribution on {1,2,3,4,5} has mean $theorMean2 
(estimated: SestMean2)") 





Symmetric Triangular Distiribution on [0,10] has mean 5.0 
(estimated: 4.999164797766807) 

Discrete Uniform Distiribution on {1,2,3,4,5} has mean 3.0 
(estimated: 3.001862) 











In line 4 we use the TriangularDist () function to create a symmetrical triangular distribution 
about 5, and store this as dist1. In line 5 we use the DiscreteUniform() function to create a 
discrete uniform distribution, and store this as dist2. Note that observations from this distribution 
can take on values from {1,2,3,4,5}, each with equal probability. In line 6 we evaluate the mean 
of the two distribution objects created above by applying the function mean () to both of them. 
These methods of mean () only use the parameters of the distribution to evaluate the mean. No data 
manipulation is taking place. In lines 8-11 we estimate the means of the two distributions by randomly 
sampling from our distributions dist1 and dist2. In lines 9-10 the Distribution object is given 
as a first argument to rand (). Lines 13-16 print the results. It can be seen that the estimated means 
are a good approximation of the actual means. 





The Inverse Probability Transform 


One may ask how does Julia (or any software package) generate random values from a given 
distribution? There are a variety of techniques for transforming pseudo-random numbers from a 
uniform distribution into numbers from a given distribution. An extensive treatment is in [KTB11]. 
One basic method which stands above the rest is inverse transform sampling. 


Let X be a random variable distributed with CDF F(-) and inverse CDF F71(.). Now take U 
to be a uniform random variable over [0, 1], and let Y = F~'(U). It holds that Y is distributed like 
X. This useful property is called the inverse probability transform and constitutes a generic method 
for generating random variables from an underlying distribution. 
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Figure 3.8: A histogram generated using the inverse probability transform 
compared to the PDF of a triangular distribution. 


To see why the method works, consider a uniform random variable U and apply to it the inverse 
probability transform F7*(-). In such a case, consider the CDF of Y = F~!(U) and see that it is 
Fy(y) =P(Y x y) = P(F7(U) < y) = PU € F(y)) = Fv(F(y)) = FQ). 


The third equality follows because F(-) is a monotonic function and can be applied to both sides of 
the inequality. The last step follows because the CDF of uniform (0,1) random variable is, 


0 for z «90, 
Fy(z)= 4z fr0<z<1l, 
1 forl<z. 


Keep in mind, that when using the Distributions package, we would typically generate ran- 
dom variables using the rand () function on a distribution type object, as performed in Listing|3.11 
above. The implementation of rand() may use the inverse probability transform or alternatively 
may use a different type of method depending on the distribution at hand. However, in Listing [3.12] 
below, we illustrate how to use the inverse probability transform with the results presented in 
Figure .8] Observe that we can implement F~1(-) via the quantile () function. 


Listing 3.12: Inverse transform sampling 


using Distributions, Plots; pyplot() 


triangDist = TriangularDist (0,2,1) 

xGrid = 0:0.1:2 

N = 10^6 

inverseSampledData = quantile. (triangDist, rand (N) ) 


histogram( inverseSampledData, bins=30, normed=true, 
ylims=(0,1.1), label="Inverse transform data") 
plot! ( xGrid, pdf. (triangDist,xGrid), c=:red, lw=4, 
xlabel="x", label="PDF", ylabel = "Density", legend=:topright) 


1 
2 
3 
4 
5 
6 
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9 
0 
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Rh 
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In line 3 we create our triangular distribution triangDist. In lines 4 and 5 we define the support over 
which we plot our data, as well as how many data-points we simulate. In line 6 we generate N random 
observations from a continuous uniform distribution over the domain [0,1] via the rand() function. 
Then the quantile () function, along with the dot operator (.) to calculate each corresponding 
quantile of triangDist. Lines 8 and 9 plot a histogram of this inverseSampledData, using 30 
bins. For large N, the histogram generated is a close approximation of the PDF of the underlying 
distribution. Lines 10-11 then plot the analytic PDF of the underlying distribution. 





3.5 Families of Discrete Distributions 


A family of probability distributions is a collection of probability distributions having some func- 
tional form that is parameterized by a well-defined set of parameters. In the discrete case, the PMF, 
p(x; 0) = P(X = 2), is parameterized by the parameter 0 € O where O is called the parameter 
space. The (scalar or vector) parameter 0 then affects the actual form of the PMF, including possi- 


bly the support of the random variable. Hence, technically a family of distributions is the collection 
of PMFs p(-; 0) for all 0 € O. 


In this section we present some of the most common families of discrete distributions. We 
consider the following: discrete uniform distribution, binomial distribution, geometric distribution, 
negative binomial distribution, hypergeometric distribution and Poisson distribution. Each of these 
is implemented in the Julia Distributions package. The approach that we take in the code 
examples of this section is to generate random variables from each distribution using first principles, 
as opposed to applying rand () on a distribution object, as was demonstrated in Listing[3.11|above. 
Understanding how to generate a random variable from a given distribution using first principles 
helps strengthen understanding of the associated probability models and processes. 


In Listing we illustrate how to create a distribution object for each of the discrete distri- 
butions that we investigate in this section. As output we print the parameters and the support of 
each distribution. 


Listing 3.13: |Families of discrete distributions 


using Distributions 

dists = [ 
DiscreteUniform(10,20), 
Simca (0), 0.5) , 
Geometric(0.5), 
NegativeBinomial(10,0.5), 
Hypergeometric(30, 40, 10), 
Poisson(5.5)] 


println("Distribution \t\t\t\t\t\t Parameters Nt Support") 
reshape([dists ;  params.(dists) ; 
((d)-» (minimum (d) ,maximum(d))).(dists) ], 
length (dists),3) 
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Distribution Parameters Support 

6x3 Array{Any,2}: 

DiscreteUniform(a=10, b=20) (10, 20) (10, 20) 
Binomial{Float64} (n=10, p=0.5) (10, 0.5) (0, 10) 

Geometric{Float64} (p=0.5) (0.543 (0, Inf) 
NegativeBinomial(Float64) (r=10.0, p=0.5) (10.05. -0%.5) (0, Inf) 
Hypergeometric (ns=30, nf=40, n=10) (30, 40, 10) (0, 10) 

Poisson{Float64} (A=5.5) (5.543 (0, Inf) 





Lines 2-8 are used to define an array of distribution objects. The help provided by the distributions 
package is useful. Use ? «Name» where «Name» may be DiscreteUniform, Binomial, etc. 
Lines 10-13 result in output that is a 6x3 array of type Any. The first column is the actual distributions 
object, the second column has the distributional parameters, and the third column represents the 
support. The parameters and the support for each distribution are presented in more detail later in 
this section. Note the use of an anonymous function (d)-» (minimum(d),maximum(d)) applied 
via ‘.’ to each element of dists. This function returns a tuple. The use of reshape () transforms 
the array of arrays into a matrix of the desired dimensions. 








Discrete Uniform Distribution 


The discrete uniform distribution is simply a probability distribution that places equal proba- 
bilities for all equal outcomes. One example is given by the probability of the outcomes of a die 
toss. The probability of each possible outcome for a fair, six-sided die is given by, 


1 
P(X = 2) == for x = 1,...,6. 


Listing simulates N tosses of a die, and then calculates and plots the proportion of times 
each possible outcome occurs, along with the PMF. The plot is in Figure For large values of 
N, the proportion of counts for each outcome converges to 1/6. 


Listing 3.14: |Discrete uniform die toss 


using StatsBase, Plots; pyplot () 


faces, N = 1:6, 1076 
mcEstimate = counts(rand(faces,N), faces) /N 








plot (faces, mcEstimate, 


line=:stem, marker=:circle, 
c=:blue, ms=10, msw=0, lw=4, label="MC estimate") 
plor! (Ii eos a. dea fecesl, [1/6 see — sin teses]l, 
line=:stem, marker=:xcross, c=:red, 
ms=6, msw-0, lw=2, label="PMF", 
xlabel="Face number", ylabel="Probability", ylims=(0,0.22)) 
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Figure 3.9: A discrete uniform PMF. 





In line 3 we define all possible outcomes of our six-sided die, along with how many die tosses we will 
simulate. Line 4 uniformly and randomly generates N observations from our die, and then uses the 
counts () function to calculate proportion of times each outcome occurs. Note that applying 
rand(DiscreteUniform(1,6),N) would yield a statistically identical result to rand (faces, N). 
Line 5 uses the stem function to create a stem plot of the proportion of times each outcome occurs, 
while line 6 plots the analytic PMF of our six-sided die. 





Binomial Distribution 


The binomial distribution is a discrete distribution which arises where multiple identical and 
independent yes/no, true/false, success/failure trials (also known as Bernoulli trials) are performed. 
For each trial, there can only be two outcomes, and the probability weightings of each unique trial 
must be the same. 


As an example, consider a two-sided coin, which is flipped n times in a row. If the probability 
of obtaining a head in a single flip is p, then the probability of obtaining x heads total is given by 
the PMF, 


P(X =)= (æa — p)" for z =0,1,...,n. 


Listing simulates n = 10 tosses of a fair coin (p = 1/2), N times total, with success 
probability p, and calculates the proportion of times each possible outcome occurs. Observe that in 
the Distributions package, pdf () applied to a discrete distribution yields the PMF. In fact, the 
PMF is often loosely called a PDF (density) in statistics. The result are presented in Figure [3.10] 
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Figure 3.10: Binomial PMF for number of heads in 10 flips each with p — 0.5. 


Listing 3.15: |Coin flipping and the binomial distribution 


using StatsBase, Distributions, Plots; pyplot () 
binomialRV(n,p) = sum(rand(n) .« p) 
ip io N= 0.5, 10, 10^ 


bDist = Binomial (n,p) 

xGrid 0:n 

bPmf = [pdf (bDist,i) for i in xGrid] 
data = [binomialRV(n,p) for _ in 1:N] 
pmfEst = counts (data, 0:n)/N 





¡oie esc rs cl JAME So y, 
line=:stem, marker=:circle, 
c=:blue, ms=10, msw=0, lw=4, label="MC estimate") 
plot!( xGrid, bPmf, 
line=:stem, marker=:xcross, c=:red, 
ms=6, msw=0, lw=2, label="PMF", xticks=(0:1:10), 
ylims=(0,0.3), xlabel="x", ylabel="Probability") 




















In line 3 we define the function binomialRV (). It generates a binomial random variable from first 
principles by creating an array of uniform [0,1] values of length n with rand (n). We then use .< 
to compare each value (element-wise) to p. The result is a vector of booleans, with each one set to 
true with probability p. Summing up this vector creates the binomial random variable. In line 9 
we create a vector incorporating the values of the binomial PMF. Note that in the Julia distributions 
package, PMFs are created via pdf (). Line 10 is where we generate N random values. In line 11 we 
use counts () from the StatsBase package to count how may times each outcome occurred, for 
0:n heads. We then normalize via division by N. The remainder of the code creates the plot. 








Note that the Binomial distribution describes part of the fishing example in Section where 
we sample with replacement. This is because the probability of success (i.e. fishing a gold fish) 
remains unchanged regardless of how many times we have sampled from the pond. 
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Geometric Distribution 


Another distribution associated with Bernoulli trials is the geometric distribution. In this case, 
consider an infinite sequence of independent trials, each with success probability p, and let X be 
the first trial that is successful. Using first principles it is easy to see that the PMF is, 


P(X—z)-p(1—p)*! for z = 1,2,.... (3.14) 


An alternative version of the geometric distribution is the distribution of the random variable X, 
counting the number of failures until success. Observe that for every sequence of trials, X = X — 1. 
From this it is easy to relate the PMFs of the random variables and see that, 


P(X = x) = p(1 — p)” errada 
In the Julia Distributions package, Geometric stands for the distribution of X, not X. 


We now look at an example involving the popular casino game of roulette. Roulette is a game of 
chance, where a ball is spun on the inside edge of a horizontal wheel. As the ball loses momentum, 
it eventually falls vertically down, and lands on one of 37 spaces, numbered 0 to 36. There are 18 
black spaces, 18 red, and a single space (‘zero‘) is green. Each spin of the wheel is independent, and 
each of the possible 37 outcomes is equally likely. Now let us assume that a gambler goes to the 
casino and plays a series of roulette spins. There are various ways to bet on the outcome of roulette, 
but in this case he always bets on black (if the ball lands on black he wins, otherwise he loses). Say 
that the gambler plays until his first win. In this case, the number of plays is a geometric random 
variable with support z = 1,2,.... Listing [3.16] simulates this scenario and creates Figure [3.11] 


Listing 3.16: |The geometric distribution 


using StatsBase, Distributions, Plots; pyplot () 


function rouletteSpins (p) 
x = 0 
while true 
x += 1 
age eweh «€ T9 
return x 
end 
end 
end 


jp, eriche N = 18/57, 187, 10%8 





mcEstimate = counts([rouletteSpins(p) for _ in 1:N],xGrid)/N 


gDist = Geometric (p) 
gPmf = [pdf (gDist,x-1) for x in xGrid] 





plot(xGrid, mcEstimate, line-:stem, marker-:circle, 
c=:blue, ms-10, msw=0, lw-4, label="MC estimate") 
plot!( xGrid, gPmf, line-:stem, marker-:xcross, 
c=:red, ms=6, msw=0, lw=2, label-"PMF", 
ylims=(0,0.5), xlabel="x", ylabel="Probability") 
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Figure 3.11: A geometric PMF. 
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The function rouletteSpins () defined in lines 3-11 is a straightforward way to generate a geometric 
random variable with support 1,2,... as X above. Lines 5-10 loop until a value is returned from the 
function. In each iteration, we increment x and check if we have a success (an event happening 
with probability p) via, rand () < p. The remainder of the code is similar to the previous listing. 
Consider the second argument to pdf () in line 18. Here x-1 is used because the built-in geometric 
distribution is for the random variable X above, which starts at 0, while we are interested in the 
geometric random variable starting at 1. 





Negative Binomial Distribution 


Recall the previous example above of a roulette gambler. Assume now that the gambler plays 
until he wins for the r'th time (in the previous example r — 1). The negative binomial distribution 
describes this situation. That is, a random variable X follows this distribution, if it describes the 
number of trials until the r'th success. The PMF is given by, 


r—1 


Joa” for x==r,r+1,r+2,.... 
Pu 


P(X = x) = ( 
Notice that with r = 1 the expression reduces to the geometric PMF |(3.14)| Similarly to the 
geometric case, there is an alternative version of the negative binomial distribution. Let X denote 
the number of failures until the r’th success. Here, like in the geometric case, when both random 


variables are coupled on the same sequence of trials, we have, X = X — r. As a result: 


: ES 
P(X =2)=(7*" )ra-»* for z = 0,1,2,.... 


To help reinforce this, in Listing below we simulate a gambler who bets consistently on 
black much like in the previous example, and determine the PMF for r = 5. That is, we determine 
the probabilities that z plays will occur up to the 5'th success (or win). 


3.0. 
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Figure 3.12: The PMF of negative binomial with r = 5 and p = 18/37. 


Listing 3.17: |The negative binomial distribution 


oo -1 O» C' & C2 2 HE 





using StatsBase, Distributions, Plots 


function rouletteSpins (r,p) 
x = 0 
wins = 0 
while true 
x += 1 
if rand() 
wins 


Y, p, N = D 13/37, 10) ^15 
xGrid = r:r+15 


mcEstimate = counts([rouletteSpins(r,p) for _ in 1:N],xGrid)/N 





nbDist = NegativeBinomial (r,p) 
nbPmf = [pdf (nbDist,x-r) for x in xGrid] 


plot( xGrid, mcEstimate, 
line=:stem, marker=:circle, c=:blue, 
ms=10, msw=0, lw=4, label="MC estimate") 
plot! ( xGrid, nbPmf, line=:stem, 
marker=:xcross, c=:red, ms=6, msw=0, lw=2, label="PMF", 
xlims=(0,maximum(xGrid)), ylims=(0,0.2), 
xlabel="x", ylabel="Probability") 

















100 CHAPTER 3. PROBABILITY DISTRIBUTIONS - DRAFT 





This code is similar to the previous listing. The main difference is in the function rouletteSpins (), 
which now accepts both r and p as arguments. It is a straightforward implementation of the negative 
binomial story. A value is returned in line 11 only once the number of wins equals r. In a similar 
manner to the geometric example notice that in line 23 we use x-r for the argument of the pdf () 
function. This is because NegativeBinomial in the Distributions package stands for a 
distribution with support, x = 0,1,2,... and not x =r,r+1,r+2,... as we desire. 





Hypergeometric Distribution 


Moving on from Bernoulli trials, we now consider the hypergeometric distribution. To put it 
in context, consider the fishing problem discussed in Section specifically the case where we 
fish without replacement. In this scenario, each time we sample from the population it decreases, 
and hence the probability of success changes for each subsequent sample. The hypergeometric 
distribution describes this situation. The PMF given by, 


KN (L-K 
lees 
L 
a 

Here the parameter L is the population size, and K is the number of successes present in the 
population (this implies that L — K is the number of failures present in the population). The 
parameter n is the number of samples taken from the population, and the input argument z is the 
number of successful samples observed. Hence a hypergeometric random variable X with P(X = 


x) = p(x) describes the number of successful samples when sampling without replacement. Note 
that the expression for p(x) can be deduced directly via combinatorial counting arguments. 





p(x) = for z = max(0,n + K —L),...,min(n, K). 


To understand the support of the distribution first consider the least possible value, max(0,n + 
K — L). It is either 0, or n+ K — L if n > L — K. The latter case stems from a situation where the 
number of samples n, is greater than the number of failures present in the population. That is, in 
such a case the least possible number of successes that can be sampled is, 


number of samples (n) — number of failures in the population (L — K). 


As for the upper value of the support, it is min(n, K) because if K < n then it isn’t possible to 
sample only successes. Note that in general if the sample size n is not “too big” then the support 
reduces to z — 0,...,n. 


To help illustrate this distribution, we look at an example where we compare several hypergeo- 
metric distributions simultaneously. As before, let us consider a pond which contains a combination 
of gold and silver fish. In this example, there are N — 500 fish total, and we will define the catch 
of a gold fish a success, and a silver fish a failure. Now say that we sample n — 30 fish without 
replacement. We consider several of these cases, where the only difference between each is the 
number of successes, K, (gold fish) in the population. 


Listing below plots the PMF's of 5 different hypergeometric distributions based on the 
number of successes in the population. The results are shown in Figure It can be observed 
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Figure 3.13: A comparison of several hypergeometric distributions 
for different proportions of successes in a population. 


that as the number of successes present in the population increases, the PMF shifts further towards 
the right. Note that in the Julia Distributions package, Hypergeometric is parameterized 
via the number of successes (first argument) and number of failures (second argument), with the 
third argument being the sample size. This is slightly different to our parameterization above, which 
uses N, K and n. 


Listing 3.18: |Comparison of several hypergeometric distributions 


using Distributions, Plots; pyplot() 


lb, isp im = 500, 1450, 400, 250, 100, 501, SO 
hyperDists = [Hypergeometric(k,L-k,n) for k in K] 
xGrid = 0:1:n 

pmfs = [ pdf.(dist, xGrid) for dist in hyperDists 
labels = "Successes = " .* string. (K) 


Dc AS 
alpha=0.8, c=[:orange :purple :green :red :blue ], 
label=hcat (labels...), ylims=(0,0.25), 
xlabel="x", ylabel="Probability", legend=:top) 











In line 3 we define the population size, L, the sample size n, and the array K, which contains the 
number of successes in the population, for each of our 5 scenarios. In line 4 the Hypergeometric () 
constructor is used to create several hypergeometric distributions. The constructor takes three argu- 
ments, the number of successes in the population k, the number of failures in the population L-k, 
and the number of times we sample from the population without replacement n. This constructor is 
then wrapped in a comprehension in order to create an array of different hypergeometric distributions, 
hyperDists. We then create an array of arrays, pmfs in line 6, by applying the pdf () function 
on each distribution. In lines 9-12, the bar () function is used to plot a bar chart of the PMF for 
each hypergeometric distribution in hyperDists. Notice the use of hcat (labels...) to convert 
labels from Array{String,1} to Array([String, 2) which is required to label the plots plots 
in bar (). 
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Poisson Distribution and Poisson Process 


The Poisson process is a stochastic process (random process) which can be used to model oc- 
currences of events over time (or more generally in space). It may be used to model the arrival of 
customers to a system, the emission of particles from radioactive material, or packets arriving to 
a communication router. The Poisson process is the canonical example of a point process captur- 
ing the most sensible model for completely random occurrences over time. A full description and 
analysis of the Poisson process is beyond our scope, however we provide an overview of the basics. 


In a Poisson process, during an infinitesimally small time interval, At, it is assumed that (as 
At > 0) there is an occurrence with probability AAt, and no occurrence with probability 1 — AAt. 
Furthermore, as At > 0, it is assumed that the chance of 2 or more occurrences during an interval 
of length At tends to 0. Here A > 0 is the intensity of the Poisson process, and has the property 
that when multiplied by an interval of length T', the mean number of occurrences during the interval 


is AT. 


The exponential distribution, discussed in the next section, is closely related to the Poisson 
process as the times between occurrences in the Poisson process are exponentially distributed. 
Another closely related distribution is the Poisson distribution that we discuss now. For a Poisson 
process over the time interval [0,7] the number of occurrences satisfy, 


xr OT) 
el 





P(x Poisson process occurrences during interval [0, T]) = for zr =0,1,.... 

The PMF p(x) = e^A* /z! for x = 0,1,2,... describes the Poisson distribution, the mean of which 
is A. Hence the number of occurrences in a Poisson process during [0,7] is Poisson distributed 
with parameter (and mean) AT. Note that in applied statistics, the Poisson distribution is also 
sometimes taken as a model for occurrences, without explicitly considering a Poisson process. For 
example, assume that based on previous measurements, on average 5.5 people arrive at a hair salon 
during rush-hour, then the probability of observing x people during rush-hour can be modeled by 
the PMF of the Poisson distribution. 


'The Poisson process possesses many elegant analytic properties, and these sometimes come as an 
aid when considering Poisson distributed random variables. One such (seemingly magical) property 
is to consider the random variable N > 0 such that, 


N N41 
I[v2e?^»][[v. (3.15) 
i=1 i=1 

where Uj, U2,... is a sequence of i.i.d. uniform(0,1) random variables and Mos U; = 1. It turns 


out that seeking such a random variable N produces an efficient recipe for generating a Poisson 
random variable. That is, the N defined by [(3.15)]is Poisson distributed with mean A. Notice that 
the recipe dictated by is to continue multiplying uniform random variables to a “running 
product” until the product goes below the desired level e7?. 


Returning to the hair salon example mentioned above, Listing below simulates this scenario, 
and compares the numerically estimated result against the PMF. The results are presented in 


Figure 
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Figure 3.14: The PMF of a Poisson distribution with mean A = 5.5. 
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Listing 3.19: |The Poisson distribution 


using StatsBase, Distributions, Plots; pyplot () 


function prn (lambda) 
ls. i = 9. i 
while p > MathConstants.e” (-lambda) 
k += 1 
p *= rand() 
end 
return k-1 
end 


Gerich llamoca, N = (gió, 5.5, 105 


pDist = Poisson (lambda) 
bPmf = pdf. (pDist,xGrid) 
data = counts([prn(lambda) for _ in 1:N],xGrid)/N 


plo ( Ciitel, Celta, 
line=:stem, marker=:circle, 
c=:blue, ms=10, msw=0, lw=4, label="MC estimate") 
plot!( xGrid, bPmf, line=:stem, 
marker=:xcross, c=:red, ms=6, msw=0, lw=2, label="PMF", 
ylims=(0,0.2), xlabel="x", ylabel="Probability of x events") 

















In lines 3-10 the function prn (), standing for “Poisson random number”, is defined. It implements 
in a straightforward manner and takes a single argument, the expected arrival rate for our 
interval lambda. Line 16 calls prn() a total of N times, counts occurrences, and normalizes them by 
N to obtain Monte Carlo estimates of the Poisson probabilities. Lines 18-23 plot these Monte Carlo 
estimates as well as the PMF. 
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3.6 Families of Continuous Distributions 


Like families of discrete distributions, families of continuous distributions are parametrized by 
a well-defined set of parameters. Typically the PDF, f(x; 0), is parameterized by the parameter 
0 € O. Hence, technically a family of continuous distributions is the collection of PDFs f(-; 0) for 
all 0 € 0. 


In this section we present some of the most common families of continuous distributions. We 
consider the following: continuous uniform distribution, exponential distribution, gamma distribu- 
tion, beta distribution, Weibull distribution, Gaussian (normal) distribution, Rayleigh distribution 
and Cauchy distribution. As was done with discrete distributions, the approach taken in the code 
examples involves generating random variables from each distribution using first principles. We also 
occasionally dive into related concepts that naturally arise in the context of a given distribution. 
These include the squared coefficient of variation, special functions (gamma and beta), hazard rates, 
various transformations, and heavy tails. 


In Listing we illustrate how to create a distribution object for each of the continuous 
distributions we cover. The listing and its output style is similar to Listing used for discrete 
distributions. 


Listing 3.20: |Families of continuous distributions 


using Distributions 
dists = [ 

Vinal se coser O, 20) , 
Exponential (3.5), 
Gamma (0.5,7), 
Beta(10,0.5), 
Weibull(10,0.5), 
Normal (20, 3.5); 
Rayleigh(2.4), 
Cane ny (20,302) 1 





println("Distribution \t\t\t Parameters \t Support") 
reshape([dists ; params.(dists) ; 
((d) => (minimum (d) ,maximum(d))).(dists) ], 
length (dists),3) 





Distribution Parameters Support 
8x3 Array{Any,2}: 





Uniform(Float64) (a=10.0, b=20.0) (10.0, 20.0) (10.0, 20.0) 
Exponential(Float64) (0=3.5) (3:555. ) (0.0, Inf) 
Gamma {Float64} (a=0.5, 0=7.0) (0.5, 7.0) (0.0, Inf) 
Beta{Float64} (a=10.0, P=0.5) (10.0, 0:5) (0.0, 1.0) 
Weibull{Float64} (a=10.0, 0=0.5) (1:0...05. 0.5) (0.0, Inf) 
Normal {Float64} (u-20.0, o=3.5) (20.0, 3.5) (-Inf, Inf) 
Rayleigh{Float 64} (0=2.4) (2.4,) (0.0, Inf) 
Cauchy {Float64} (w=20.0, o=3.5) (20.0, 3.5) (-Inf, Inf) 
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Figure 3.15: The PDF of a continuous uniform distribution over [0, 27]. 


Continuous Uniform Distribution 


The continuous uniform distribution describes the case where the outcome of a continuous ran- 
dom variable X has a constant likelihood of occurring over some finite interval. Since the integral 
of the PDF must equal one, given an interval (a,b), the PDF is given by 





for a € x <S b, 


0 forr«aorz-»b. 


As an example, consider the case of a fast spinning circular disk, such as a hard drive. Imagine 
now there is a small defect on the disk, and we define X as the clockwise angle (in radians) the 
defect makes with the read head at an arbitrary time. In this case X is modeled by the continuous 
uniform distribution over x € [0,27]. Listing [3.21] creates Figure [3.15] where we compare the PDF 
and a Monte Carlo based estimate. 


Listing 3.21: |Uniformly distributed angles 


using Distributions, Plots, LaTeXStrings; pyplot () 


cUnif = Uniform(0, 27) 
meric, IN = OsO.ils2r, LOS 


stephist( rand(N)*2m, bins=xGrid, 
normed-:true, c-:blue, 
label="MC Estimate") 
plo! ( zeric pel. (CUm Gizile!) y 
c-:red,ylims-(0,0.2),1abel-"PDF", ylabel-"Density",xticks-([0:7z/2:27;], 
[NOR der Np) 2p pa, dra (s pii den EOS] 
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In line 3 the Uniform() function is used to create a continuous uniform distribution over the domain 
(0, 27]. In Julia you can use the unicode character 7 or pi. In line 6, rand (N) *27 is used to generate 
N uniform random values on [0,27]. An alternative would be to use rand (cUni£, N). In our case, we 
simulate N continuous uniform random variables over the domain [0,1] via the rand () function, and 
then scale each of these by a factor of 27. A histogram of this data is then plotted using stephist (). 
Notice that the bins argument is set to the range xGrid. An alternative would be to specify an 
integer number of bins. Line 9 uses pdf() on the distribution object cUnif to plot the analytic 
PDF. Notice the use of L from the LaTexStrings package in line 11 for creating formulas. 





Exponential Distribution 


As alluded to in the discussion of the Poisson process above, the exponential distribution is 
often used to model random durations between occurrences. A non-negative random variable X, 
exponentially distributed with a rate parameter A > 0 has PDF, 


f(a) = de®. 


As can be verified, the mean is 1/A, the variance is 1/A?, and the CCDF is F(x) = e^. Note 
that in Julia, the distribution is parameterized by the mean, rather than by A. Hence to create an 
exponential distribution object with A = 0.2 (for example), one would use Exponential (5.0). 





Exponential random variables possess a lack of memory property. It can be verified that, 
P(X >t+s|X >t)=P(X >s). 


To show this, expand the conditional probability and use the CCDF. A similar property holds 
for geometric random variables. This hints at the fact that exponential random variables are the 
continuous analogs of geometric random variables. 


To explore this further, consider a transformation of an exponential random variable X, Y = 
[x | , where [-] represents the mathematical floor function. In this case, Y is no longer a continuous 
random variable, but is discrete in nature, taking on values in the set {0,1,2,...}. 


We can show that the PMF of Y is, 


y+1 
py(y) =P(|X|] =y) = f àÀe 7* dg = (e "1 =e) for y= 0; 2 ek 
y 


If we then set p = 1—e7?, we observe that Y is a geometric random variable which starts at 0 and 
has success parameter p. 


In Listing |3.22| we present a comparison between the PMF of the floor of an exponential random 
variable, and the PMF of the geometric distribution covered in Section [3.5] Remember that in Julia 
the support of Geometric() starts at x = 0. The listing creates Figure 
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Figure 3.16: The PMF of the floor of an exponential random variable is a 
geometric distribution. 


Listing 3.22: |Flooring an exponential random variable 


using StatsBase, Distributions, Plots; pyplot () 


lambda, N = 1, 10^6 
xGrid = 0: 





expDist = Exponential (1/lambda) 
floorData = counts (convert. (Int, floor. (rand(expDist,N))), xGrid)/N 
geomDist = Geometric(1-MathConstants.e^-lambda) 





plot( xGrid, floorData, 
line=:stem, marker=:circle, 
c=:blue, ms=10, msw=0, lw=4, 
label="Floor of Exponential") 

plot! ( xGrid, pdf. (geomDist,xGrid), 
line=:stem, marker=:xcross, 
c=:red, ms=6, msw=0, lw=2, 
label="Geometric", ylims=(0,1), 
xlabel="x", ylabel="Probability") 

















In line 6 the Exponential () function is used to create the exponential distribution object, expDist. 
Note that the function takes one argument, the inverse of the mean, hence 1/lambda is used. In 
line 7 we use the rand() function to sample N times from the exponential distribution expDist. 
The floor() function is then used to round each observation down to the nearest integer, and 
the convert () function is used to convert the values from Float64 to Int type. The function 
counts () is then used to count how many times each integer in xGrid occurs, and the proportions 
are stored in the array floorData. In line 8 we use the Geometric () function, covered previously, to 
create a geometric distribution object with probability of success 1-MathConstants.e^-lambda. 
Lines 10-18 plot the results where pdf () is applied to geomDist in line 14. 
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Gamma Distribution and the Squared Coefficient of Variation 


The gamma distribution is commonly used for modeling asymmetric non-negative data. It 
generalizes the exponential distribution and the chi-squared distribution (covered in Section [5.2] in 
the context of statistical inference). To introduce this distribution, consider the following example, 
where the lifetimes of light bulbs are exponentially distributed with mean A7!. Now imagine we are 
lighting a room continuously with a single light bulb, and that we replace the bulb with a new one 
when it burns out. If we start at time 0, what is the distribution of time until n bulbs are replaced? 


One way to describe this time is by the random variable T', where, 
T=X1+X2+...+ Xn, 


and X; are i.i.d. exponential random variables representing the lifetimes of light bulbs. It turns out 
that the distribution of T' is a gamma distribution. In this case, since it is a sum of i.i.d. exponential 
random variables it is also called an Erlang distribution. 


We now introduce the PDF of the gamma distribution. It is a function (in x) proportional 
to qo1¿ Az 
parameter respectively. In order to normalize this function we need to divide by, 


oo 
ti ge le^ dg. 
0 


It turns out that this integral can be represented by I'(a)/A*, where I'(-) is a well known mathe- 
matical special function called the gamma function, see (3.16). We investigate the gamma function, 
and the related beta function and beta distribution below. After using the gamma function for 
normalization, the PDF of the gamma distribution is, 


, where the non-negative parameters A and a are called the rate parameter and shape 





K )= A? a=l¿ Am 


In the light bulbs case, we have that T ~ Gamma(n, A), with shape parameter a = n. In general 
for a gamma random variable, Y ~ Gamma(a, A), the shape parameter a does not have to be a 
whole number. It can analytically be evaluated that, 


Q Q 
E[Y] = — and Var(Y) = 32 


We also take the opportunity here to introduce another general notion of variability, often used for 
non-negative random variables, namely the squared coefficient of variation, 

Var(Y) 

E[Y]? 














SCV = 

















The SCV is a normalized, or unit-less version of the variance. The lower it is, the less variability 
in the random variable. It can be seen that for a gamma random variable, the SCV is 1/a and for 
our light bulb example above, SCV(T’) = 1/n. Hence for large n, i.e. more light bulbs, there is less 
variability. 


Listing considers the three cases of n = 1, n = 10 and n = 50 light bulbs (the case of n = 1 
is exponential). For each scenario, gamma random variables are simulated by generating sums of 
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Figure 3.17: Plot of histograms of Monte Carlo simulated 
gamma observations, against their analytic PDFs. 


exponential random variables. In each case, we set the rate parameter for the light bulbs at An, so 
that the mean time until all light bulbs run out is 1/A, independent of n. The resulting histograms 
are then compared to the theoretical gamma PDF's. Note that the Julia function Gamma () is not 
parametrized by A, but by 1/A in a similar fashion to the Exponential () function. This inverse 
of the rate parameter is called the scale parameter. 


Listing 3.23: [Gamma random variable as a sum of exponentials 


using Distributions, Plots; pyplot () 


llamo, IN = 1/3, 05 

USA il, LO, 50) 

mCi] = 030). i18 10) 

C = [:blue :red :green] 

dists = [Gamma(n,1/(n*lambda)) for n in bulbs] 





function normalizedData (d: :Gamma) 

sh = Int64 (shape (d)) 

data = [sum(- (1/(shx*lambda))x*log.(rand(sh))) for _ in 1:N] 
end 


L = [ "Shape = "xstring. (shape. (1))x*", Scale = "x 
string. (round. (scale. (1) ,digits=2)) for i in dists ] 


stephist ( normalizedData.(dists), bins=50, 
normed=:true, c=C, xlims=(0,maximum(xGrid)),ylims=(0,1), 
xlabel="x", ylabel="Density", label="") 
plot! (xGrid, [pdf.(i,xGrid) for i in dists], c=C, label=reshape(L, 1,:)) 
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In lines 3-6 we define the main variables of our problem. In line 4 we create the array bulbs which 
stores the number of bulbs for each case. In line 6 we create an array of colors which are used later for 
formatting the plots. In line 7 the Gamma () function is used along with a comprehension to create 
a Gamma distribution for each of our cases. The three Gamma distributions are stored in the array 
dists. Lines 9-12 define the function normalizedData () which operates on a Gamma distribution 
as specified via ::Gamma. The function obtains the shape parameter of the input distribution via 
shape () and converts this to an integer. Then -log. (rand (sh) ) is a raw way of generating a unit 
mean collection of sh exponential random variables using the inverse probability transform. These 
are then scaled by the scalar, (1/ (shxlambda) ). Lines 14-15 generate the string array L used for 
the legend. Notice the use of the round() function. The remainder of the code plots the histograms 
and the actual PDFs. 





Beta Distribution and Mathematical Special Functions 


The beta distribution is a commonly used distribution when seeking a parameterized shape over 
a finite support. It is parametrized by two non-negative parameters, œ and f. It has a density 
proportional to z*-1(1— 2)! for x € [0, 1]. By using different positive values of a and 8, a variety 
of shapes can be produced. You may want to try and create such plots yourself to experiment. 
One common example is a = 1, 3 = 1, in which case the distribution defaults to the uniform(0,1) 
distribution. 


As was with the gamma distribution above, in the case of beta, we are also left to seek a 
normalizing constant K, such that when multiplied by x*^!(1— x)?-1, the resulting function has a 
unit integral over [0,1]. In our case, 





and hence the PDF is f(x) = K x?-!(1— z)^-1. 


We now explore the beta distribution. By focusing on the normalizing constant, we gain further 
insight into the mathematical gamma function T'(-), which is a component of the gamma distribution 
covered previously. It turns out that, 


_ T(a +8) 
T(o)F(8)' 


Mathematically, this is called the inverse of the beta function, evaluated at a and 8. Let us focus 
solely on the gamma functions, with the purpose of demystifying their use in the gamma and beta 
distributions. The gamma function is a type of special function, and is defined as, 


oo 
T(z) =) z^ le^? de, (3.16) 
0 
It is a continuous generalization of factorial. We know that for positive integer m, 


n! — n: (n — 1)!, with 0! = 1. 
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This is the recursive definition of factorial. The gamma function exhibits similar properties, and 
one can evaluate it via integration by parts, 


I'(z) = (z — 1)- I'(z — 1). 
Note furthermore that, I'(1) = 1. Hence we see that for integer values of z, 
T(z) 2 (z — 1). 


We now illustrate this in Listing and in the process take into consideration the mathematical 
function beta and the beta PDF. Observe the difference in Julia between lower case gamma () , 
the special mathematical function, and Gamma () , the constructor for the distribution. The same 
applies to beta () and Beta(). 


Listing 3.24: 'The gamma and beta special functions 


using SpecialFunctions, Distributions 


0.25 05.7 
o 15 


a, 
x 


betaABl = beta (a,b) 

betaAB2 = (gamma (a) gamma (b) ) /gamma (a+b) 

betaAB3 = (factorial (a-1) factorial (b-1))/factorial (a+b-1) 
betaPDFAB1 = pdf (Beta (a,b),x) 

betaPDFAB2 = (1/beta(a,b))*xx^(a-1) * (1-x)^(b-1) 


println ("beta (Sa, $b) = SloysiccvNBiL, Western, WesleewseuNES) U) 
println("betaPDF ($a, $b) at $x = SbetaPDFAB1, \t$betaPDFAB2") 





beta(0.2,0.7) = 5.576463695849875, 5.576463695849875, 5.576463695849877 
betaPDF (0.2,0.7) at 0.75 = 0.34214492891381176, 0.34214492891381176 





We use the SpecialFunctions package for gamma () and beta (). This package also introduces a 
method for factorial () that allows to evaluate I (z) via factorial (z-1) even for non-integer z. 
In lines 6-8, the beta () special function at a and b is evaluated in three different ways. 





Another important property of the gamma function that we encounter later on (in the context of 
the Chi squared distribution, which we touch on in Section|5.2) is that T(1/2) = y7. In Listing 
we show this via numerical integration. 

Listing 3.25: The gamma function at 1/2 


using QuadGK, SpecialFunctions 


g(x) = x^(0.5-1) * MathConstants.e^-x 
cuaca (Gj, 0), Liat) [1], mee (on), cemina(l/2), factorial (1/2—1) 





(1.7724538355037913, 1.7724538509055159, 1.772453850905516, 1.772453850905516) 





This example uses the QuadGK package, in the same manner as introduced in Listing [3.3] We can see 
that the numerical integration is in agreement with the analytically expected result. 
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Weibull Distribution and Hazard Rates 


We now explore the Weibull distribution along with the concept of the hazard rate function, which 
is often used in reliability analysis and survival analysis. For a random variable T, representing the 
lifetime of an individual or a component, an interesting quantity is the instantaneous chance of 
failure at any time, given that the component has been operating without failure up to time x. This 
can be expressed as, 





1 
h(z) = lim AP(T € [2,0 +A) | T» 2). 


Alternatively, by using the conditional probability and noticing that the PDF f(x) satisfies f(z)A ~ 
P(x < T < x+ A) for small A, we can express the above as, 


f(x) 


h(x) = 1— FT 


(3.17) 
Here the function h(-) is called the hazard rate, and it is a common method of viewing the distri- 
bution for lifetime random variables T. In fact, we can reconstruct the CDF F(x) by, 


1 — F(x) = exp (- ra h(t) dt). (3.18) 


Hence every continuous non-negative random variable can be described uniquely by its hazard rate. 
The Weibull distribution is naturally defined through the hazard rate by considering hazard rate 
functions that have a specific simple form. It is a distribution with, 


h(x) At. (3.19) 


where A is positive and a takes on any real value. Notice that the parameter a gives the Weibull 
distribution different modes of behavior. If a = 1 then the hazard rate is constant, in which case the 
Weibull distribution is actually an exponential distribution with rate A. If a > 1, then the hazard 
rate increases over time. This depicts a situation of “aging components”, i.e. the longer a component 
has lived, the higher the instantaneous chance of failure. T'his is sometimes called /ncreasing Failure 
Rate (IFR). Conversely, a < 1 depicts a situation where the longer a component has lasted, the 
lower the chance of it failing (as is perhaps the case with totalitarian political regimes). This is 
sometimes called Decreasing Failure Rate (DFR). 


Based on (3.19) and using (3.18) we obtain the CDF and PDF, 


a 


F(z) =1- e a? ; and f£) = Arlean, (3.20) 


Note that in Julia, the distribution is parameterized slightly differently via, 
Qa a—1 a Aya 
E ud —(x/0)% _ 97949716570 ?x 
f(x) 9 ( 5) e o0 ?rz^ e , 


where the bijection from A to @ is, 


A=007% and  0= om (3.21) 


In this case, 0 is called the scale parameter and a is the shape parameter. 


In Listing we look at several hazard rate functions for different Weibull distributions using 
the parameterization (3.20), and show their differences in Figure The example also shows how 
to use the shape() and scale() functions from the Distributions package. 
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Instantaneous failure rate 





Figure 3.18: Hazard rate functions for different Weibull distributions. 


Listing 3.26: |Hazard rates and the Weibull distribution 


O0 -1O» C' & Co 2 HE 





using Distributions, Plots, LaTeXStrings; pyplot () 


alphas LO. 
lam = 


lambda (dist: :Weibull) = shape (dist)x*scale (dist)” (-shape (dist) ) 
theta (lam, alpha) = (alpha/lam)” (1/alpha) 


dists = [Weibull.(a,theta(lam,a)) for a in alphas] 


hA(dist,x) = pdf (dist,x) /ccdf (dist,x) 
hB (dist, x) lambda (dist) *x* (shape (dist) -1) 


zeric = 0.0260). 01 2 JL) 

hazardsA = [hA. (d, xGrid) for d in dists] 

hazardsB = [hB.(d,xGrid) for d in dists] 

println("Maximum difference between two implementations of hazard: ", 
maximum (maximum. (hazardsA-hazardsB))) 


Cl = [:blue :red :green] 
= [L"\lambda=" x string(lambda(d)) « ", " x L"\alpha =" x* string(shape (qd) ) 
for d in dists]l 





plot (xGrid, hazardsA, c=Cl, label=reshape (Lb, 1,:), xlabel="x", 
ylabel="Instantaneous failure rate", xlims=(0,10), ylims=(0,10)) 


Maximum difference between two implementations of hazard: 1.7763568394002505e-15 
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In line 6, we define the function lambda (), which operates on a Weibull type distribution and 
implements the first equation in (3.21). Note the type specification : :Weibull and the use of the 
shape() and scale() functions. In line 7 we define the function theta() which implements the 
second equation in (3.21). Line 9 constructs three Weibull objects in the array dists. Lines 11 and 
12 implement two alternative ways of calculating the hazard rate function, hA() and hB(). The first 
uses and the second uses (3.19). 'Then in lines 18-19, we verify that the two implementations 
are in agreement. The remainder of the code creates Figure 





Gaussian (Normal) Distribution 


Arguably, the most well known distribution is the normal distribution, also known as the Gaus- 
sian distribution. It is a symmetric “bell curved’ shaped distribution, which can be found throughout 
nature. Examples include the distribution of heights among adult humans and noise disturbances 
of electrical signals. It is commonly exhibited due to the central limit theorem, which is covered in 


more depth in Section 


The Gaussian distribution is defined by two parameters, jj and o?, which are the mean and 
variance respectively. The mean y can take on any value and c? is restricted to be positive. The 
phrase standard normal signifies the case of a normal distribution with u = 0 and c? = 1. In the 
general case, the PDF is given by, 


Qo E 
€ 2g? 


ov 27 





f(a) = 


The CDF of the normal distribution is not available as a simple expression. However, it is 
frequently needed and hence statistical tables or software are often used. The CDF of the standard 
normal random variable is, 


(x) = [. aes dt = (i tet(7)). (3.22) 


The second expression represents ®(-) in terms of the error function erf(-). It is a mathematical 


special function defined as, 
2 To 
erf(r) = — | e * dt. 
VT Jo 


With ®(-) (or alternately erf(-)) tabulated, one can move on to a general normal random variable 
with mean y and variance o?. In this case, the CDF is available via, 


(A) 


As an illustrative example, Listing [3.27] plots the standard normal PDF, along with its first and 
second derivatives in Figure [3.21] The first derivative is clearly 0 at the PDF’s unique maximum at 
x = 0. The second derivative is 0 at the points x = —1 and x = +1. These are exactly the inflection 
points of the normal PDF (points where the function switches between being locally convex to 
locally concave or vice-versa). This code example also illustrates the use numerical derivatives from 
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Figure 3.19: Plot of the standard normal PDF 
and its first and second derivatives. 


the Calculus package. The code also presents two alternative ways of implementing ®(-) of (3.22) 
and shows they are equivalent. One way uses cdf () from the Distributions package and the 
other way uses erf () from the SpecialFunctions package. 


Listing 3.27: |Numerical derivatives of the normal density 
using Distributions, Calculus, SpecialFunctions, Plots; pyplot () 
crio = —5281).01 585 


PhiA (x) orbs (Ileri (x/sqrt (2)))) 
PhiB (x) COMINO MINS) 


println("Maximum difference between two CDF implementations: ", 
maximum(PhiA. (xGrid) - PhiB. (xGrid) ) ) 


normalDensity(z) = pdf (Normal (),z) 
do normalDensity. (xGrid) 
d derivative. (normalDensity,xGrid) 


d2 second derivative. (normalDensity, xGrid) 


plot(xGrid, [dO dl d2], c=[:blue :red :green],label-[L"f(x)" L"f'(x)" L"f'' (x)"]) 
loc! ([—S, Si], 10,01, color=:black, lw=0.5, xlabel="x", xlims=(-5,5), label="") 





Maximum difference between two CDF implementations: 1.1102230246251565e-16 





Lines 5-9 are dedicated to showing the equivalence of the two ways of implementing ®(-). In line 11 we 
define the function normalDensity (), which takes an input z, and returns the corresponding value 
of the PDF of a standard normal distribution. Then in lines 14-15, the functions derivative () and 
second derivative() are used to evaluate the first and second derivatives of normalDensity 
respectively. The curves are plotted in lines 17-18. 
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Rayleigh Distribution and the Box-Muller Transform 


We now consider an exponentially distributed random variable, X, with rate parameter A = 
07?/2 where o > 0. If we set a new random variable, R = VX, what is the distribution of R? To 
work this out analytically, we have for y > 0, 


y? 
Fry) = P(VX < y) = P(X < y?) = Fx(y*) =1exp (7755), 


and by differentiating, we get the density, 


2 
seo (373) 
= exp|—-=;}. 
fry) a p 2g2 
This is the density of the Rayleigh Distribution with parameter c. We see it is related to the 
exponential distribution via a square root transformation. Hence the implication is that since we 
know how generate exponential random variables via —}log(U) where U ~ uniform(0, 1), then by 
applying a square root we can generate Rayleigh random variables. 


The Rayleigh distribution is important because of another distributional relationship. Consider 
two independent normally distributed random variables, Ni and N2, each with mean 0 and standard 
deviation c. In this case, it turns out that R = y/N? + NZ is Rayleigh distributed just as R above. 
As we see in the next example this property yields a method for generating normal random variables. 
It also yields a statistical model often used in radio communications called Rayleigh fading. 


Listing [3.28] demonstrates three alternative ways of generating Rayleigh random variables. It 
generates R and R as above, as well as by applying rand() to a Rayleigh object from the 
Distributions package. The mean of a Rayleigh random variable is o 4/ 7/2 and is approximately 
2.1306 when o = 1.7, as in the code below. 


Listing 3.28: |Alternative representations of Rayleigh random variables 


using Distributions, Random 
Random.seed! (1) 


exei s (— (2e slg*2) log. Cranc (iN) )) ) 


1 
2 
3 
4 
5 
6 
7 
8 


= Normal (0,sig) 
= sqrt. (rand(distG,N).*2 + rand(distG,N).^2) 


= Rayleigh(sig) 
data3 = rand(distR,N) 





mean. ([datal, data2, data3]) 


3-element Array{Float64,1}: 
2.1309969895700465 
2.1304634508886053 
2.1292020616665392 
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Figure 3.20: Geometry of the Box-Muller transform. 





Line 7 generates datal, according to R above. Note the use of element wise mapping of sqrt () 
and log(). Lines 9 and 10 generate data2, as in R above. Here we use rand() applied to the 
normal distribution object distG. Lines 12 and 13 use rand() applied to a Rayleigh distribution 
object. Line 15 produces the output by applying mean () to datal, data2 and data3 individually. 
Observe that the sample mean is very similar to the theoretical mean presented above. 





A common way to generate normal random variables, called the Boz-Muller Transform, is to 
use the relationship between the Rayleigh distribution and a pair of independent zero mean normal 
random variables, as mentioned above. Consider Figure [3.20] representing the relationship between 
the pair (Nj, N2) and their polar coordinate counterparts, R and 0. Assume now that the Cartesian 
coordinates of the point (N1, N2) are identically normally distributed, with N; independent of Na 
and set c = 1. In this case, by representing Ni and Nə in polar coordinates (0, R) we have that 
the angle 0 is uniformly distributed on [0,27] and that the radius R is distributed as a Rayleigh 
random variable. 


Given this, a recipe for generating Ni and Na is to first generate 0 and R and then transform 
them via, 


Nı = R cos(0), Nə = R sin(@). 


Often, Nə is not needed. Hence in practice, given two independent uniform(0,1) random variables 
U; and Uz we set, Z = y—21nU; cos(27 U2). Here Z has a standard Normal distribution. Listing 
[3.29] uses this method to generate normal random variables and compares their histogram to the 
standard normal PDF. The output is presented in Figure [3.21] 


Listing 3.29: |The Box-Muller transform 


using Random, Distributions, Plots; pyplot () 
Random. seed! (1) 


Z() = sqrt (-2xlog(rand()))*cos (2*pixrand() ) 
senes] = =480 ¿0184 


histogram([Z() for _ in 1:10^6], bins=50, 
normed=true, label="MC estimate") 
PIAR dd Normal ce 
c=:red, lw=4, label="PDF", 
xlims=(-4,4), ylims=(0,0.5), xlabel="x", ylabel="f (x) ") 
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Figure 3.21: The Box-Muller transform can be used to generate 
a normally distributed random variable. 





In line 4 we define a function Z () , which implements the Box-Muller transform and generates a single 
standard normal random variable. The remaining lines plot a histogram of 10° random variables from 
Z (), and compares the standard normal PDF. Notice the use of the L macro for latex formatting in 
line 11. 





Cauchy Distribution 


We now introduce the Cauchy distribution, also known as the Lorentz distribution. At first 
glance a plot of the PDF looks very similar to the normal distribution. However, it is fundamentally 
different as its mean and standard deviation are undefined. The PDF of the Cauchy distribution is 
given by, 


Ey) 


where zo is the location parameter at which the peak is observed and y is the scale parameter. 


f(a) = 





; (3.23) 


In order to better understand the context of this type of distribution we will develop a physical 
example of a Cauchy distributed random variable. Consider a drone hovering stationary in the sky 
at unit height. A pivoting laser is attached to its undercarriage, which pivots back and forth as 
it shoots pulses at the ground. At any point the laser fires, it makes an angle 0 from the vertical 
(—7/2 € 0 < 7/2) as is illustrated in Figure [3.22] 


Since the laser fires at a high frequency as it is pivoting, we can assume that the angle 0 is 
distributed uniformly on |-7/2, 7/2]. For each shot from the laser, a point can be measured, X, 
horizontally on the ground from the point above which the drone is hovering. We can now consider 
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Figure 3.22: Indicative distribution of laser 
shots from a hovering drone. 


this horizontal measurement as a new random variable, X. Hence the CDF is, 


: atan(x) < —7/2, 
atan(x), atan(x) € (—7/2,7/2), 
: 1/2 < atan(x). 


Fx(z) = P(tan(0) < x) = P(0 < atan(z)) = Fo(atan(x)) = 


= aj O 


Now since it always holds that atan(r) € (—7-/2,7-/2) we can obtain the density by taking the 
derivative of latan(x) which evaluates to, 


1 


f(z) = m1 +22) 


This is a special case of the more complicated density [(3.23)| with zo = 0 and y = 1. Importantly, 
the expectation integral, 


i. - of (x) dz, 


is not defined since each of the one sided improper integrals does not converge. Hence a Cauchy 
random variable is an example of a distribution without a mean. 


You may now revisit the law of large numbers (Section and ask what happens to sample 
averages of such random variables. That is, would the sequence of sample averages converge to 
anything? The answer is no. We illustrate this in Listing [3.30] and the associated Figure In 
this example, occasional large values create huge spikes due to angles near —7/2 or 7/2. There is 
no strong law of large numbers in this case since the mean is not defined. 


Listing 3.30: |The law of large numbers breaks down with very heavy tails 


using Random, Plots; pyplot () 
Random. seed! (808) 


in = 10°6 
data = tan. (rand(n)*xpi .- pi/2) 
averages = accumulate (+,data) ./collect (1:n) 


plot( 1:n, averages, 
c=:blue, legend=:none, 
xscale-:log10, xlims=(1,n), xlabel="n", ylabel="Running average") 
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Running average 








Figure 3.23: Cumulative average of Cauchy distributed random variables. 





In line 2 the seed of the random number generator is set, so that the same stream of random numbers 
is generated each time. In line 5 we create data, an array of n Cauchy random variables con- 
structed through the angle mechanism described and illustrated in Figure B.22| In line 6 we use the 
accumulate () function to create a running sum, and then divide this element wise via ./ by the 
array collect (1:n). Notice that ‘+’ is used as the first argument to accumulate (). Here the 
addition operator is treated as a function. The remainder of the code plots the running average. 





3.7 Joint Distributions and Covariance 


We now consider pairs and vectors of random variables. In general, in a probability space, 
we may define multiple random variables, X4,..., Xn where we consider the vector or tuple, X = 
(X,,..., Xn) as a random vector. A key question deals with representing and evaluating probabilities 
of the form P(X cB de where B is some subset IR". Our focus here is on the case of a pair of random 
variables (X,Y), which are continuous and have a density function. The probability distribution 
of (X,Y) is called a bivariate distribution and more generally, the probability distribution of X is 
called a multi-variate distribution. 


The Joint PDF 


A function, fx : R” > R is said to be a joint probability density function (PDF) if for any input, 
Viso Zn, it holds that: fx G3, Top. 2n) 2 0 and, 


J [o | temen dxydx...dty — 1. (3.24) 


— 00 —00 
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Figure 3.24: A contour plot and a three dimensional surface plot of f(x, y). 
Hence if we consider now B C R", then the probabilities of a random vector X, distributed with 


P(X e B) = f txt dx 


As an example let X = (X, Y) and consider the joint density, 


Hosp (EDI s € [0,1], ye [0,1], 
, 0, 


density fx, can be evaluated via, 


otherwise. 


This PDF is plotted in Figure We may now obtain all kinds of probabilities. For example, set 
B = {(x,y) | £ < y), then, 


P((2,y) € B) = [. E. Fx, y) dy de = = = 0.3875. (3.25) 


The joint distribution of X and Y allows us to also obtain related distributions. We may obtain 
the marginal densities of X and Y, denoted fx(-) and fy(-), via, 


1 1 
= f f(x, y)dy and fy(y)= / f(x,y) dx 


For our example by explicitly integrating we obtain, 


avi — x(1 + 107) and fv(y) = vi — y(8 + 5y). 


In general, the random variables X and Y are said to be independent if, f(x,y) = fx(x)fv(y). 
In our current example, this is not the case. Furthermore, whenever we have two densities of 
scalar random variables, we may multiply them to make the joint distribution of the random vector 
composed of independent random variables. That is, if we take our fx(-) and fy(-) above, we may 
create f(x, y) via, 


f(x,y) = fx(#) fy(y) = 


fx(z) = 


9 
200 





(1— z)(1— y)(1 + 10x)(8 + 5y). 
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Observe that f(x,y) # f(x,y). Hence we see that while both bivariate distributions have the 
same marginal distribution, they are different bivariate distributions and hence describe different 
relationships between X and Y. 


Of further interest is the conditional density of X given Y, and vice-versa. It is denoted by 
fx | y=y(x) and describes the distribution of the random variable X, given the specific value Y = y. 
It can be obtained from the joint density via, 


TN OL a 


— dv() — Sio f (v, y) dx 





In Listing we generate Figure|3.24| and in addition use crude Riemann sums to approximate 
the integral (3.25) as well as the integral over the total density. 


Listing 3.31: |Visualizing a bivariate density 


using Plots, LaTeXStrings, Measures; pyplot () 


delta = 0.01 

grid = 0:delta:1 

f(x,y) = 9/8 Ury) xsqrt (A) (ASA) 
m = (i ESA seeks wy tin Genel, oe sll y tell] 


densityIntegral = sum(z) *delta%*2 
println("2-dimensional Riemann sum over density: ", densityIntegral) 


probB = sum([sum([f(x,y)*delta for y in x:delta:1])*delta for x in grid]) 
println("2-dimensional Riemann sum to evaluate probability: ", probB) 


pi surface(grid, grid, z, 
c=cgrad([:blue, :red]), la=1, camera=(60,50), 
ylabel="y", zlabel-L"f(x,y)", legend-:none) 
Gomitowuet (Guil gnueiel, 2, 
c-cgrad([:blue, :red])) 
p2 someone! (Gizilel, pede 2, 
c=:black, xlims-(0,1), ylims-(0,1), ylabel="y", ratio-:equal) 


plot(pl, p2, size=(800, 400), xlabel="x", margin=5mm) 





2-dimensional Riemann sum over density: 1.0063787264382458 
2-dimensional Riemann sum to evaluate probability: 0.3932640388868346 





In line 5 we define the bivariate density function, f (). In line 6 we evaluate the density over a grid of 
x and y values. This grid is then used to obtain a crude approximation of the integral in line 8 with 
the result printed in line 9. Similarly, the nested integral is approximated via two Riemann 
sums in line 11 with the result printed in line 12. The remainder of the code creates Figure 
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Covariance and Vectorized Moments 


Given two random variables, X and Y, with respective means, wx and py, the covariance is 


defined by, 


























Cov(X, Y) =E[(X — uxY(Y — uy] = EIX Y] — potty. 


The second formula follows by expansion. Notice also that Cov(X, X) — Var(X) by comparing 
with |(3.3)| The covariance is a common measure of the relationship between the two random 
variables, where if Cov( X, Y) — 0, we say the random variables are uncorrelated. Furthermore, if 
Cov(X, Y) Z 0, its sign gives an indication of the relationship. 


Another important concept is the correlation coefficient, 
|, Cov(X,Y) 
BW VNar(X)Var(Y) 


It is a normalized form of the covariance with —1 € pxy < 1. Values nearing +1 indicate a very 





(3.26) 





strong linear relationship between X and Y, whereas values near or at 0 indicate a lack of a linear 
relationship. 


Note that if X and Y are independent random variables, then Cov(X, Y) = 0 and hence pxy = 
0. However, the opposite case does not always hold, since in general pxy = 0 does not imply 
independence. Nevertheless as described below, for jointly normal random variables it does. 


Consider now a random vector X = (X4,..., Xn), taken as a column vector. It can be described 
by moments in an analogous manner to a scalar random variable as was detailed in Section A 
key quantity is the mean vector, 
































px = [E[X1], ELX9],...,E[Xn]] 


Furthermore, the covariance matrix is the matrix defined by the expectation (taken element wise) 
of the (outer product) random matrix given by (X — ux)(X — x)", and is expressed as 














Xx = Cov(X) = E[(X — u5)(X — pa)". (3.27) 


As can be verified, the i, j’th element of Xx is Cov(X;, Xj) and hence the diagonal elements are 
the variances. 


Linear Combinations and Transformations 


We now consider linear transformations applied to random vectors. For any collection of random 
variables, 





























E[X; +... + Xn] = E[X1] +... + E[X;.]. 


For uncorrelated random variables, 











Var(X1 +... + Xn) = Var(X1) +... + Var( X4). 
More generally if we allow the random variables to be correlated, then, 


Var(X1+...+ Xn) = Var(Xi) +... + Var(Xn) + 29  Cov(X;, Xj). (3.28) 
i<j 
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@ Exponential 
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Figure 3.25: Random vectors from three different distributions, 
each sharing the same mean and covariance matrix. 


Note that the right hand side of (3.28) is the sum of the elements of the matrix Cov((X1,...,Xn)). 
This is a special case of a more general affine transformation, where we take a random vector 
X = (X1,..., Xn) with covariance matrix Xx, and an m x n matrix A and m vector b. We then 
set, 


Y=AX+b. (3.29) 


In this case, the new random vector Y exhibits mean and covariance, 








E 


[Y] = AE[X|- b and  Cov(Y)= AXxAT. (3.30) 




















Now to retrieve (3.28]] we use the 1 x n matrix A = [1,...,1] and observe that AXx A” is a sum of 
all of the elements of Xx. 


The Cholesky Decomposition and Generating Random Vectors 


Say now that you wish to create an n dimensional random vector Y with some specified mean 
vector py and covariance matrix Xy. That is, wy and Xy are known. 


The formulas in [[3.30]] yield a potential recipe for such a task if we are given a random vector 
X with zero mean and identity covariance matrix (Xx = I). For example, in the context of Monte 
Carlo random variable generation, creating such a random vector X is trivial — just generate a 
sequence of n i.i.d. normal(0,1) random variables. 


Now apply the affine transformation |(3.29)|on X with b = wy and a matrix A that satisfies, 
Xy — AAT. (3.31) 


Now |(3.30)| guarantees that Y has the desired py and Xy. 
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The question is now how to find a matrix A that satisfies [(3.31)| For this the Cholesky de- 
composition comes as an aid. As an example assume we wish to generate a random vector Y 
with, 

15 6 4 


= and y = 
PX la XE y 


Listing|3.32|generates random vectors with these mean vector and covariance matrix using three 
alternative forms of zero-mean, identity-covariance matrix random variables. As you can see from 
Figure such distributions can be very different in nature even though they share the same 
first and second order characteristics. The output also presents mean and variance estimates of the 
random variables generated, showing they agree with the specifications above. 


Listing 3.32: |Generating random vectors with desired mean and covariance 


using Distributions, LinearAlgebra, LaTeXStrings, Random, Plots; pyplot () 
Random. seed! (1) 


N = 10^5 
SigY = 
muy e prs. 


201 
A = cholesky (SigY) .L 


OANDOoKRWNEH 


rngGens = [()->rand(Normal ()), 
() =>zamal (Una tEoram (seque (3) y sera (9) 9) 5 
()-»rand(Exponential())-1] 





rv(rg) = Ax[rg(),rg()] + muY 
data = [[rv(r) for _ in 1:N] for r in rngGens] 


stats(data) = begin 
datal, data2 = first. (data),last. (data) 
println (round (mean (datal),digits=2), "\t",round(mean(data2),digits=2),"\t", 
round (var (datal),digits=2), "Nt", round(var(data2),digits-2), "Nt", 
round (cov (datal,data2),digits-2)) 
end 


println("MeanlNtMean2NtVarlNtVar2NtCov") 
for d in data 

stats (d) 
end 





scatter (first. (data[1]1), last.(data[1]), c=:blue, ms-1, msw=0, label="Normal") 
scatter! (first. (data[2]), last. (data[2]1), c=:red, ms-1, msw=0, label="Uniform") 
scatter! (first. (data[3]),last. (data[3]),c=:green, ms=1,msw=0,label="Exponential", 
xlims=(0,40), ylims=(0,40), legend=:bottomright, ratio=:equal, 
xlabel-L"X 1", ylabel-L"X 2") 














Meanl Mean2 Varl Var2 Cov 
14.99 19.99 6.01 9.0 4.0 
15.0 20.0 6.01 8.96 3.97 


15.0 19.98 6.03 8.85 4.01 
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We define the covariance matrix SigY and the mean vector muY in lines 6-9. In line 10 we use 
cholesky() from LinearAlgebra together with .L to compute a lower triangular matrix A that 
satisfies (3.31). In lines 12-14 we define an array of functions, rngGens, where each element is a 
function that generates a scalar random variable with zero mean and unit variance. The first entry is 
astandard normal, the second entry is a uniform on [— V3, V3] and the third entry is a unit exponential 
shifted by —1. The function we define in line 16, rv () , assumes an input argument which is a function 
to generate a random value and then implements the transformation (3.29). In line 18 we create an 
array of 3 arrays, with each internal array consisting of N 2-dimensional random vectors. We then 
define a function stats () in lines 20-25 which calculates and prints first and second order statistics. 
Note the use of begin and end to define the function. The function is then used in lines 27-30 for 
printing output. The remainder of the code creates Figure [3.25] using data. 





Bivariate Normal 


One of the most ubiquitous families of multi-variate distributions is the multi-variate normal 
distribution. Similarly to the fact that a scalar (univariate) normal distribution is parametrized by 
the mean y and the variance c?, a multi-variate normal distribution is parametrized by the mean 
vector ux and the covariance matrix Xx. 


We begin first with the standard multi-variate having ux = 0 mean and Xx = I. In this case, 
the PDF for the random vector X = (Xj,..., Xn) is, 


f(x) = (2n) "cx, (3.32) 
Listing illustrates numerically that this is a valid PDF for increasing dimensions. The 
example also illustrates how to use numerical integration. The integral (3.24) is carried out. As is 


observed from the output, the integral is accurate for dimensions n = 1,...,8 after which accuracy 
is lost for the given level of computational effort specified (up to 10" function evaluations). 


Listing 3.33: |Multidimensional integration 


using HCubature 


(BS) Gin E) exp e (1/2) +2 sz) 


in 1:maxD 
= —Mxones (n) 
= Mxones (n) 
I,e = hcubature(f, a, b, maxevals = 1077) 
println("n = $(n), integral = $(1), error (estimate) 




















n= 1, integral = 0.9999932046537506, error (estimate) = 4.365848932375016e-10 
n = 2, integral = 0.9999864091389514, error (estimate) = 1.487907641465839e-8 
n = 3, integral = 0.9999796140804286, error (estimate) = 1.4899542976517278e-8 
n = 4, integral = 0.9999728074508313, error (estimate) = 4.4447365681340567e-7 
n = 5, integral = 0.999965936103044, error (estimate) = 2.3294669134930872e-5 
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n = 6, integral = 0.9999639124757695, error (estimate) = 0.0003937954462609516 
n= 7, integral = 1.0001623151630603, error (estimate) 0.0031506650163379375 
n = 8, integral = 1.0074827348433588, error (estimate) = 0.023275741664597824 
n = 9, integral = 1.2233043761463287, error (estimate) = 0.3731125349186617 

n = 10, integral = 0.42866209316161175, error (estimate) = 0.22089760603668285 








We use the HCubature package. In line 3 we define M. Then the integration is performed over a square 
of width twice of M, centered at the origin. In line 4 we define maxD as the number of dimensions up 
to which we wish to carry out integration. The function definition in line 6 implements (3.32). We 
loop over the dimensions in lines 8-13, each time computing the integral in line 11 where we specify 
maxevals as the maximum number of evaluations. The result is a tuple of the integral value and 
error, which are assigned to I and e respectively. 





Now in general, using an affine transformation like|(3.29)| it can be shown that for arbitrary ux 
and Xx (positive definite), 


f(x) = |Ex|- 2 (257/27 BHA) Ex! G-ux) 


El 


where |- | is the determinant. In the case of n = 2, this becomes the bivariate normal distribution 
with a density represented as, 
1 


fxy (2, y; Ox, OY, Ux, HY, p) = 2roxoy /1 — p 
conl =l jep 2pla — nx (y — nv) ser 








-+ 
1 — p?) 0% OXOY oy 


Here the elements of the mean and covariance matrix are spelled out via, 


2 
"— Ux and Sy = O% o dd 
HY OXOyp Oy 


Note that p € (—1, 1) is the correlation coefficient as defined in [(3.26) 


In Section we fit the five parameters of a bivariate normal to weather data and keep the 
results as assignment statements to meanVect and covMat in the file mvParams . 31. The example 
below, illustrates a plot of random vectors generated from a distribution matching these parameters. 


Here we use the MvNormal () constructor from Distributions to create a multi-variate normal 
distribution object. The listing generates Figure 
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Listing 3.34: |Bivariate normal data 


using Distributions, Plots; pyplot () 


include("../data/mvParams.j1") 
biNorm = MvNormal (meanVect, covMat) 


= 10%3 
points = rand(MvNormal (meanVect, covMat) ,N) 


SUPL = 150.5954 


z — || pel (lomo, E, Y1) stoke Y ¿a Ssugooidte, 22 clin Gs | 


pl = scatter (points[1,:], points[2,:], ms=0.5, c=:black, legend=:none) 
pl contour! (support, support, z, 
levels=[0.001, 0.005, 0.02], c=[:blue, :red, :green], 
xlims=(15,40), ylims=(15,40), ratio=:equal, legend=:none, 
xlabel="x", ylabel="y") 
p2 surface (support, support, z, lw=0.1, c=cgrad([:blue, :red]), 
legend=:none, xlabel-"x", ylabel="y",camera=(-35, 20) ) 





plot(pl, p2, size=(800, 400) ) 





In line 3 we include another Julia file defining meanVect and covMat. This file is generated in 
Listing of Chapter In line 4 we create an MvNormal distribution object representing the 
bivariate distribution. In line 7 we use rand() with a method provided via the Distributions 
package to generate random points. The rest of the code plots Figure Notice the call to 
contour () in lines 13-16, with specified levels. In lines 17-18 the parameters supplied via camera 
are horizontal rotation and vertical rotation in degrees. 





40 


0.025 


0.020 


0.015 


0.010 


0.005 


40 








Figure 3.26: Contour lines and a surface plot for a bivariate normal 
distribution with randomly generated points on the contour plot. 


Chapter 4 


Processing and Summarizing Data - 
DRAFT 


In this chapter we introduce methods and techniques for processing and summarizing data. 
In statistics nomenclature, the act of summarizing data is known as descriptive statistics. In data- 
science nomenclature such activities take the names of analytics and dash-boarding, while the process 
of manipulating and pre-processing data is sometimes called data cleansing, or data cleaning. 


The statistical techniques and tools that we introduce include summary statistics and methods 
for data visualization, sometimes called Exploratory Data Analysis (EDA). We introduce several 
Julia tools for this, including the DataFrames package which allows for the storage of datasets 
that contain non-homogeneous data and includes support for missing entries. We also use the 
Statistics and StatsBase packages, which contain useful functions for summarizing data. 


In practice statisticians and data-scientists often collect data in various ways, including experi- 
mental studies, observational studies, longitudinal studies, survey sampling, and data scraping. Then 
to gain insight from the data, one may consider different data configurations such as: 


Single sample: A case where all observations are considered to represent items from a homogeneous 
population. The configuration of the data takes the form: 11,%2,..., Tp. 


Single sample over time (time-series): The configuration of the data takes the form: Ly, zi... Xt 
with time points ty < t2 <... < tn. 


n 


Two samples: Similar to the single sample case, only now there are two populations (x's and y’s). 
The configuration of the data takes the form: z1,...,z4 and y1,...,Ym- 


Generalizations from two samples to k samples (each of potentially different sample size, n1, .. . , Nx). 


Observations in pairs (2-tuples): In this case, although similar to the two sample case, each observa- 
tion is a tuple of points, (x, y). Hence the configuration of data is (x1, y1), (12,Y2),..., (Un, Yn). 


Generalizations from pairs to vectors of observations. (%11,...,Uip),---,(Unt,--- > Unp)- 


Other configurations including relationship data (graphs of connections), images, and many more. 
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This chapter is structured as follows: In Section [4.1] we see how to manipulate tabular data via 
data frames in Julia. In Section we deal with methods of summarizing data including basic 
elements of descriptive statistics. We then move on to plotting where in Section [4.3] we present a 
variety of methods for plotting single sample data. In Section [4.4] we present plots for comparing 
samples. Section presents plots for multivariate and high-dimensional data. We then present 
more simplistic business style plots in Section The chapter closes with Section where we 
show several ways of handling files using Julia as well as how to interact with a server side database. 


For readers who wish to better understand the concepts of copies and mutability used in Sec- 
tion the subsection below provides an optional overview. It can be skipped on a first reading. 


Mutability, References, Shallow Copies and Deep Copies in Julia 


When using any programming language, it is useful to have a basic understanding of how data is 
organized and referenced in memory. For this reason we now briefly overview the differences between 
mutable types, immutable types, reference copying, shallow copies and deep copies in Julia. We also 
introduce the basic programming concepts of ‘call by value’ and ‘call by reference’. This basic 
understanding is important in its own right, however it may also help readers better understand 
certain aspects of Julia’s DataFrame package, described in the sequel. 


As a starting point, we review the difference between two mechanisms for passing variables to 
functions. Assume you have a variable x, a function f (), and then you then execute f (x). One 
can envision two general mechanisms by which this can take place. The first is named call by 
value and describes a situation where the code implementing f () gets a copy of the variable x. 
As f () executes, even if its code appears to modify x, it is actually modifying a local copy. The 
second mechanism is named call by reference and describes a situation where f () obtains a memory 
reference (or pointer) to x. In such a case, as f () executes, if it modifies x, then it actually modifies 
values in the original memory location of x. 


In Julia, both mechanisms exist under a unified umbrella called pass by sharing. This means 
that variables are not copied when passed to functions. However, if a value is about to be changed 
within a function then depending on the mutability attribute of its type, either of the mechanisms 
may be employed. If the variable type is immutable then a local copy is made and the behavior 
follows the ‘call by value’ type. However, if the type is mutable then the called function does not 
create a local copy. Instead, it can modify the original variable according to the ‘call by reference’ 
mechanism. Hence the variable’s property, mutable or immutable, determines which function calling 
mechanism is exhibited. 


As a general rule, primitive types such as Int 64 or Float32 are immutable. The same goes 
for composite types defined using the struct keyword. An exception to this is for composite types 
that are explicitly defined as mutable struct. Note that the code examples in this book seldom 
define types - however many of the types we use from packages are composite types. While not 
often used, if you wish to programmatically check if the type of a variable is immutable or not, you 
can use the isimmutable () function. 


Importantly, arrays are mutable. Listing [4.1] implements two different methods for the function 
f (). The first method is for Int, a primitive type (immutable), and the second is for Array {Int } 
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(mutable). It then demonstrates the ‘call by value’ behavior exhibited for the primitive type, while 
the ‘call by reference’ behavior is exhibited for the array. 


Listing 4.1: |Call by value vs. call by reference 


f(z::Int) = begin z = 0 end 
f(z::Array{Int}) = begin z[1] = 0 end 


x= 1 

@show typeof (x) 

@show isimmutable (x) 

println("Before call by value: ", x) 
f(x) 

println("After call by value: ", x,"\n") 


1 
2 
3 
4 
5 
6 
Y 
8 


x = [1] 

@show typeof (x) 

@show isimmutable (x) 

println("Before call by reference: ", 
f(x) 

println("After call by reference: ", x) 





typeof(x) = Int64 
isimmutable(x) = true 
Before call by value: 1 
After call by value: 1 


typeof (x) = Array{Int64,1} 
isimmutable(x) = false 
Before call by reference: [1] 
After call by reference: [0] 





In line 1 we implement a method of £() for integer types. The code z = 0 will operate on a local 
copy of z. In line 2 we implement a method of f () for arrays. Here the code z[1] = 0 will modify 
the first entry of the input argument z. Lines 4-9 use the first method, passing the variable x into 
f(). As can be see from the output, the operation of the function f () does not modify x. Also 
note the use of the @show macro, useful for debugging or understanding code. Lines 11-16 invoke the 
method of f () for arrays of integers (this is multiple dispatch). The key point is that f (x) in line 15 
modifies the original x from global scope. 





Ideally, for performance reasons, the level of actual copying of memory should be kept to a 
minimum. This is the underlying motivation for having a default ‘pass by reference’ mechanism 
when working with arrays, as you can give functions references to huge data arrays without any 
memory duplication. However, this entails some level of danger because function calls may modify 
variables that are passed to them as arguments. For this reason, Julia offers explicit functions for 
creating copies of variables, namely copy () and deepcopy (). The former creates a ‘shallow copy’ 
of the variable and copies all entries, but does not do it recursively. The latter recursively produces 
a copy until a completely independent copy of the variable is created. 


We demonstrate the different type of copies and their interaction with mutability in Listing [4.2] 
The basic example on which we apply a deep copy is a doubly nested array, e.g. [[10]]. In 
this case, using copy () will not be applied to the inner array [10], however using deepcopy () 
recursively copies all mutable entries. 
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Listing 4.2: [Deep copy and shallow copy 


println("Immutable: 
10 
a 
- 20 
@show a 


println("NnNo copy: 
a = [10] 
b=a 
= 20 
@show a 


oo -q1O0o» cU R0 rb.-2 


jgue bate IL (md aye Y) 
a = [10] 
ou copyi(a) 
- 20 
@show a 


printin("\nShallow 
a = [xen 

b = copy (a) 
IS] = 20 

@show a 


printin("\nDeep copy:") 
a = DKON 

b = deepcopy (a) 

A e 20 

@show a; 





Immutable: 
a = 10 


No copy: 
a = [20] 


Copy: 
a = [10] 





Shallow copy: 
a = Array{Int64,1}[[20]] 


Deep copy: 
a = Array{Int64,1}[[10]] 





Lines 1-5 exhibit no surprise due to immutability. The Int 64 a is assigned to b and b is modified in 
line 4. At this point Julia creates a copy because the variable is immutable. Lines 7-11 demonstrate 
different behavior. The array a is mutable and hence after b is assigned to a in line 9, the modification 
of b in line 10 also modifies a. Lines 13-17 show a case where a copy () of a is created. In this case 
modification of b in line 16 does not alter a. Lines 19-23 are similar, however in this case the fact 
that copy () is only a shallow copy matters. The variable b has a new outer array, however the inner 
array is still shared with a. Hence the modification in line 22 modifies the inner array of a as well. 
Finally, in lines 25-29 this is resolved by creating a deepcopy (). 
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4.1 Working with Data Frames 


In cases where data is homogeneous, arrays, matrices, and tensors are popular ways of organizing 
data. However, more commonly datasets are heterogeneous in nature, or contain incomplete or 
missing entries. In addition, datasets are often large, and commonly require “cleaning”. In such 
cases, more advanced data storage mechanisms are needed. 


The Julia DataFrames package introduces a data storage structure known as a DataFrame, 
which is aimed at overcoming these challenges. It can be used to store columns of different types, 
and also introduces the missing variable type which, as the name suggests, is used in place of 
missing entries. The missing type has an important property, in that it ‘poisons’ other types 
it interacts with. For example, if x represents a value, then x + missing == missing. This 
ensures that missing values do not ‘infect’ and skew results when operations are performed on data. 
For example, if mean () is used on a column with a missing value present, the result will evaluate 
as missing. We show ways of dealing with missing values in Listing 4.7 


Data frames are easy to work with. They can be created manually or data can be imported 
from a csv or txt file. Columns and rows can be referenced by their position index, name (i.e. 
symbol), or according to a set of user-defined rules. We now explore some of their functionally. See 


http://juliadata.github.io/DataFrames.jl/stable/|for further documentation. 


Introducing the Data Frame 


We now introduce data frames through the exploration and formatting of an example dataset. 
The dataset has four fields; Name, Date, Grade and Price. In addition, as is often the case 
with real datasets, there are missing values present. Therefore, before analysis can start, some data 
cleaning must be performed. 


Any variable in a dataset can be classified as either a numerical variable, or categorical variable. 
A numerical variable is a variable in which the location of the measurement on the number line 
is meaningful. Examples include height and weight. A categorical variable communicates some 
information based on categories, or characteristics via grouping. Categorical variables can be further 
split into nominal variables, such as blood type, or names, and ordinal variables, in which some order 
is communicated, such as grades on a test, A to E. In our example, Price is a numerical variable, 
while Name is a nominal categorical variable. Since, in our example, Grade can be thought of as a 
rating (A being best, and E being worst) it is an ordinal categorical variable. 


Having covered types of variables, we begin a step by step example of using data frames. In 
Listing 4.3| we load the data from the file purchaseData.csv into a data frame and inspect its 
contents. Often comma separated values files (csv files) contain a header row which gives details of 
each column. However in other cases, no header row appears. In our case, the file’s first row is is a 
header row and it contains the names of the columns. Hence the first row of the file is: 


Name, Date, Grade, Price 
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Listing 4.3: Creating and inspecting a DataFrame 


using DataFrames, CSV 
data = CSV.read("../data/purchaseData.csv", copycols = true) 


println(size(data),"in") 
println (names (data), "An") 


( 
( 
pelar lim (Else (dera, 6); yay 
println (describe (data), " An") 





(200, 4) 


Symbol[:Name, :Date, :Grade, :Price] 


6x4 DataFrame 











Row Name Date Grade Price 
String String String Int64 

1 MARYANNA 14/09/2008 A 79700 

2 REBECCA 11/03/2008 B missing 

3 ASHELY 5/08/2008 E 24311 

4 KHADIJAH 2/09/2008 missing 38904 

5 TANJA 1/12/2008 C 47052 

6 JUDIE 17/05/2008 D 34365 


4x8 DataFrame 


























Row variable mean min median max nunique nmissing eltype 

Symbol Union Any Union Any Union Int64 Union 
1 Name ABBEY ZACHARY 182 17 Union{Missing, String} 
2 Date 1/07/2008 9/10/2008 141 4 Union{Missing, String} 
3 Grade A E 5 13 Union{Missing, String} 
4 Price 39702 .0 8257 38045.5 79893 14 Union{Missing, Int64} 








In line 1 we specify use of the DataFrames package, which allows us to use DataFrame type objects. 
We also use the CSV package for reading csv files. In line 2 CSV.read() is used to create a data 
frame object, populated with data from the file specified. Note that our file has a header row, however 
in cases where there isn’t a header use header = false. We use copycols = true to create a 
data frame with mutable columns (the default is false). If the default was used, each column would 
be of the read-only CSV.Column type. In line 4 the size () function is used to return the number of 
rows and columns of the data frame as a tuple. Two other useful functions not shown here are nrow () 
and ncol(), which return the number of rows and number of columns respectively. In line 5 the 
names () function is used to return an array of all column names as symbols. In line 6 the first () 
function is used to display the first six lines of the data frame, as specified by the second argument. 
Note that last () can be used to display the last several rows instead. In line 7 describe () is used 
to create a data frame with a summary of the data in each column of the input data frame (data in 
our case). On inspection, by looking at the nmissing column, one can see there are missing values 
present, and we return to this problem in Listing [4.7] 
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Referencing Data 


We now look at ways in which entries within a data frame can be referenced. Individual entries 
can be referenced by both row and column index. Columns can be referenced by their name 
represented as a symbol, or by their index. Multiple rows or multiple columns can be referenced 
via a collection of symbols or indices. We demonstrate several aspects of this in Listing [4.4] below. 


Listing 4.4: (Referencing data in a DataFrame 


using DataFrames, CSV 
data = CSV.read("../data/purchaseData.csv", copycols = true) 


print Medem penson 1s UL dwell. SW 

", ", data[1,:Grade], 

i Y daras trace fy}, V Nm) 
Panda A AN o) 
paladar NN 
joreaLiove Jlin («eleven . Nene [L4 83.5] AN) 
println(data[13:15, [:Name]]) 


1 
2 
3 
4 
5 
6 
Y 
8 
9 
0 


E 





Grade of person 1: A, A, A 


3x4 DataFrame 











| Row | Name | Date | Grade | Price 
| | String | String | String | Int64 
| 1 | MARYANNA | 14/09/2008 | A | 79700 
| 2 | REBECCA | 11/03/2008 | B | missing 
| 3 | KHADIJAH | 2/09/2008 | missing | 38904 











Union{Missing, String}["SAMMIE", missing, "STACEY" 

















Union{Missing, String}["SAMMIE", missing, "STACEY" 


3x1 DataFrame 
| Row | Name | 








| | String | 
| 1 | SAMMIE | 
| 2 | missing | 
| 3 | STACEY | 








In lines 4-6 we see different ways of accessing the element from the first row and third column labeled 
:Date. In line 7 the rows and columns to be extracted are designated by the first and second 
arguments, [1,2,4], and ‘:’ respectively. Note that ‘:’ can be used to select either all rows, or 
all columns. Line 8 is somewhat similar, but here a unit range is used to select rows 13-15, while 
the symbol :Name is used so that only the Name column is extracted. Alternatively, the column 
could have been referenced by its index, i.e. data[13:15, 1], or the syntax data.Name [13:15] 
could have been used instead. Note that although lines 9 and 10 look similar, there is an important 
difference. The code in line 9 creates an array, while that of line 10 creates a data frame object, due 
to the extra set of []. If one wanted, additional columns could also be selected by including them in 
[], separated by ‘,’. 
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Modifying Data 


In general, entries of a data frame can be updated like entries of a matrix, however in certain 
cases care is required. Functions performed on a data frame will return a copy of that data frame. 
In other words, no underlying change will be made to the data frame object, but rather a shallow 
copy will be made and returned as output. Often one wants to change the values within a data 
frame. However, by default the columns of a data frame are immutable, which means that the values 
within them cannot be changed. In order to make changes to a column, the column must first be 
mutable. One way to do this is by including copycols=true when a data frame is created from 
a csv file. This argument has the effect of making all columns mutable. Another way is by using 
! when referencing the rows of a data frame. For example, df[!, :X] references the underlying 
data in column :X, while d£[:, :X] simply references a shallow copy. In Listing below we 
show how these approaches work. 


Listing 4.5: [Editing and copying a DataFrame 


using DataFrames, CSV 
datal = CSV.read("../data/purchaseData.csv") 
data2 = CSV.read("../data/purchaseData.csv", copycols=true) 


try datal[1, :Name] = "YARDEN" catch; @warn "Cannot: datal is immutable" end 





data2[1, :Name] = "YARDEN" 
¡caca (a,  ieslieste (ara, SB), "NU 





1 
2 
3 
4 
5 
6 
y 
8 
9 


dare HPP Price ./= 1000) 
rename! (datal, :Price=>Symbol ("Price(000’s)")) 
petaca (Gesliesic Catal, 3), “a 








replace (lato ico iE E O 
pica la (irrst (leue, 3), Wm") 





Warning: Cannot: datal is immutable 














Row Name | Date | Grade | Price | 

String | String | String | Int64 | 
1 YARDEN | 14/09/2008 | A | 79700 | 
2 REBECCA | 11/03/2008 | B | missing | 
3 ASHELY | 5/08/2008 | E | 24311 | 














3x4 DataFrame 














Row Name | Date | Grade | Price(000's) 
String | String | String | Float64 | 

1 MARYANNA | 14/09/2008 | A [ER | 
REBECCA | 11/03/2008 | B | missing | 

3 ASHELY | 5/08/2008 | E | 24.311 | 








3x4 DataFrame 




















Row Name | Date | Grade | Price | 

String | String | String | Int64 | 
J YARDEN | 14/09/2008 | A | 79700 | 
2 REBECCA | 11/03/2008 | B | missing | 
3 ASHELY | 5/08/2008 | F | 24311 | 
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In lines 2-3 two dataframes are created, the first data1 has immutable columns, while data2 has 
mutable columns due to the second argument in CSV.read(). In line 5 we try to change the value 
of the first row and first column in datal. This is done within the try/catch structure. If an error 
occurs within try, the code jumps to catch and continues. Since datal is immutable, an error is 
returned, and so the code after catch runs. Here we use the @warn macro to return a warning. In 
line 7 we try the same change for data2, and since this data frame is mutable, we are able to make 
the change. In line 10 we perform division on every row element in the :Price column of datal by 
using ! to reference all rows. By using this syntax, the underlying : Price column data is referenced, 
and the column changed to mutable, which then allows us to make make the change. The actual 
change is done via the combination of the broadcast ‘.’ operator, which extends the in-place division 
via \= to each row. Note the column type changes from Int 64 to Float 64. In line 11 rename! () 
is used to rename the :Price column as shown, with a pair of values, separated via =>, given as 
the second argument. Finally, in line 14, replace! () is used to replace all D and E entries in the 
:Grade column to E and F respectively. Note that replace! () operates on an iterable, hence the 
use of the ... splat operator, and finally note that the order of replacement does not matter, as the 
replacement does not advance one after the other sequentially. Again note that ! was used for row 
referencing. 








Copying a Data Frame 


When copying a data frame, the same rules and principles that are relevant for other Julia types 
apply. These were discussed at the start of this chapter and demonstrated in Listing We now 
show how copy () and deepcopy () can be used with data frames in Listing [4.6] 


Listing 4.6: [Using copy () and deepcopy () with a DataFrame 


using DataFrames, CSV 
datal = CSV.read("../data/purchaseData.csv", copycols=true) 
panorama veltes Yy Claris eee eim es AN) 


data2 = datal 
data2.Name[1] = "EMILY" 
@show datal.Name[1] 





0 JOANA 


datal = CSV.read("../data/purchaseData.csv", copycols=true) 
data2 = copy (datal) 

data2.Name[1] = "EMILY" 

(show datal.Name[1] 

¡date mn (0) 





datal DataFrame () 
cercat = [10,11], (100, 10411 
data2 copy (datal) 
datado gp [pug = =i 
@show datal.X[1] [1] 


datal = DataFrame(X = [[0,1],[100,101]]) 
data2 = deepcopy (datal) 

data2.X[1][1] = -1 

(show datal.X[1][1]; 
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Original value: MARYANNA 





datal.Name[1] = "EMILY" 
datal.Name[1] = "MARYANNA" 
(datal.X[1]) [1] = -1 
(datal.X[1]) [1] = 0 





We first create a data frame from a csv file where datal.Name[1] is the string MARYANNA. Then 
in lines 5-7, setting data2 = datal simply implies that data2 refers to the same object as datal. 
Hence modifying dat a2 in line 6 results in a modification of data1. In lines 9-13 we circumvent such 
a situation by using the copy () function. In this case setting the new name into data2, EMILY, does 
not affect datal. However, in other cases a shallow copy isn’t enough for separating data frames. 
This is the case in lines 15-19 where we create a data frame with a column named X comprised of 
arrays. In this case, the copied data frame, dat a2, still refers to the original entries (arrays), because 
these are mutable and were not copied via copy () in line 17. The consequence is that modifying a 
specific entry of data2 as in line 18 actually modifies datal. This is then circumvented by using 
deepcopy () as in lines 21-24. 








Handling Missing Entries 


We now look more closely at the case when missing values are present in a data frame. As 
discussed at the start of this section, missing ‘poisons’ other types on interaction, and this property 
ensures that missing values do not ‘infect’ and skew results when operations are performed on a 
dataset. The DataFrames package comes with several useful functions for dealing with missing 
entries. The Missing. jl package also provides extra functionality for dealing with missing values. 
In Listing [4.7] below we elaborate on some of the functions useful for dealing with missing values. 


Listing 4.7: [Handling missing entries 


using Statistics, DataFrames, CSV 
data = CSV.read("../data/purchaseData.csv", copycols=true) 


println (mean (data.Price),"\n") 

println (mean (skipmissing(data.Price)),"\n") 
prime laleoalesce. (Cera, Crade, "00 lada, ab) 
println (first (dropmissing (data, :Price), 4),"\n") 
println(sum(ismissing. (data.Name)),"\n") 

println(findall (completecases (data) ) [1:4]) 


1 
2 
3 
4 
5 
6 
a 
8 
9 





missing 


39702.01075268817 


"A, "Br, "E", ngg] 





4x4 DataFrame 
Row | Name | Date | Grade | Price | 
| String | String | String | Int64 | 











1 | MARYANNA | 14/09/2008 | A | 79700 | 
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2 | ASHELY | 5/08/2008 | E | 24311 | 
3 | KHADIJAH | 2/09/2008 | missing | 38904 | 
4 | TANJA | 1/12/2008 ¡e | 47052 | 
17 
1, 3; 5, 6] 








In line 4 we attempt to calcualte the mean of the :Price column of data, however missing is 
returned as this column contains missing values. By comparison, in line 5 skipmissing() is first 
used to return a copy of the data from the :Price column which has no missing entries, and after 
this mean () applied. In line 6 data.Grade is used to obtain a reference to the : Grade column, and 
then the coalesce () function is used to replace all missing values with the string ‘QQ’. The first 
four values are accessed via [1:4] to verify the replacement has occurred. In line 7 dropmissing () 
is used to drop all rows which have missing in the :Price column. If no second argument is given, 
dropmissing() will drop all rows that contain missing. In line 8 ismissing() is used with 
the broadcast operator to check if values in the :Name column are missing. If they are, true 
is returned, else false. Then sum() is used to calculate how many missing entries are present. 
The result, 17, can be verified from the output of Listing [4.3] where we see the number of missing 
entries. In line 9 completecases () is used to check if each row contains fully completed fields, i.e. 
no missing values. If no missing values are present, true is returned, else false, for each row. 
Then findall() is used on this array to return an array of row indexes which have no missing 
values, and to shorten the output, the first four values of this array are printed. 





Reshaping, Joining and Manipulating Data Frames 


When working with data it is not uncommon to want to perform operations such as merges or 
joins between several data sets, or to split or reshape the structure of a data set. The DataFrames 
package makes this easy, as it provides many useful functions to do these types of operations. In 
Listing we present brief examples of some of the more useful functions for merging and joining 
data frames. 
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Listing 4.8: (Reshaping, joining and merging data frames 


using DataFrames, CSV 
data = CSV.read("../data/purchaseData.csv", copycols = true) 


newCol = DataFrame (Validated=ones (Int, size(data,1))) 

newRow Dactalricaime (ION, VINO" [123456, 90959511, lS 

newData = DataFrame (Name=["JOHN", "ASHELY", "MARYANNA"], 
Job=["Lawyer", "Doctor", "Lawyer"]) 





«oo-1c»o]0cumRotr-z- 


data = hcat (data, newCol) 
oi ioen (leete (dara, 3), ya) 


Rh 
Or c 


data = vcat(data, newRow, cols=:union) 
primicia (least (cata, 3), "uv 


eRe 
oR w 


data = join(data, newData, on=:Name) 
forest tlm (dera, Uy") 


Pre 
oo N 0 


select! (data, [:Name, :Job]) 
pra la (dara, ya) 


N NR 
=. O © 


unique! (data, :Job) 
println (data) 


N 
N 








3x5 DataFrame 





























Row Name | Date | Grade | Price Validated | 
String | String | String | Int64 Int64 | 

1 MARYANNA | 14/09/2008 | A | 79700 1 | 
REBECCA | 11/03/2008 | B | missing 1 | 

S ASHELY | 5/08/2008 | E | 24311 1 | 

3x6 DataFrame 

Row Name | Date | Grade | Price | Validated | PhoneNo | 
Any | String | String | Int64 | Int64 | Any | 

1 RIVA | 30/12/2008 | E | 21842 | 1 | missing | 

2 JOHN | missing | missing | missing | missing | 123456 

3 JACK | missing | missing | missing | missing | 909595 | 


3x7 DataFrame 








Row Name Date | Grade | Price | Validated | PhoneNo | Job | 

Any String | String | Int64 | Int64 | Any | String | 
1 MARYANNA 14/09/2008 | A | 79700 | 1 | missing | Lawyer | 
2 ASHELY 5/08/2008 | E | 24311 | 1 | missing | Doctor | 
3 JOHN missing | missing | missing | missing | 123456 | Lawyer | 








3x2 DataFrame 























Row Name Job 
Any String 
1 ARYANNA Lawyer 
ASHELY Doctor 
3 JOHN Lawyer 
2X2 DataFrame 
Row Name Job 
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| | Any | String | 
| 1 | MARYANNA | Lawyer | 
| 2 | ASHELY | Doctor | 








In line 2 we create data in the same manner as the previous listings. In lines 4-6 we create three 
separate data frames. The first, newCol, consists of a single column :Validated with the same 
number of rows as data. The second, newRow, consists of two rows with :Name and :PhoneNumber 
columns. The third, newData, has two rows and two columns, :Name and : Job. In line 9 hcat () is 
used to horizontally concatenate newCol to data. In line 12 vcat () is used to vertically concatenate 
data and newRow, with the new row appended to the bottom of the data frame. Note cols=:union 
is used so that all columns from both data frames are kept, and missing entries recorded where 
applicable. Alternatively, :equal or : intersect could have been used, or an array of columns to 
be kept instead. In line 15 join () is used to join data and newData together, based on the : Name 
column. Note that join() can be used in several different ways. The functions select! () and 
unique! () are demonstrated in lines 18-22. Another function not shown here is stack () , which 
can be used to stack a data frame from a wide format to a long format. We recommend the reader 
consult the DataFrames manual for further information on each of the functions listed here. 





Useful Operations for Data Frames 


We have already covered some of the many useful functions available in the DataFrames pack- 
age, such as replace! () and rename! (), both introduced in Listing[4.5] We now provide insight 
into several more concepts, including sorting, changing a column of strings to Date types, how to 
make a column Categorical (useful when constructing models, as covered in [8.4), and finally 
how to split, apply, and combine data all in one via the by () function. Listing [4.9] demonstrates 
these. 





Listing 4.9: [Manipulating DataFrame objects 


using DataFrames, CSV, Dates, Statistics 
data = dropmissing(CSV.read("../data/purchaseData.csv", copycols=true)) 


data[!,:Date] = Date. (data[!, Datel, "d/m/y") 
praia (first (sore (dera, e¿Dere), 3) 5 wy 


ica la (Ense (ilter (eo => row [sPricel > 50000, detaj, ), wa") 


Oo -1O»)C0U 4 C2 F2 — 


categorical! (data, :Grade) 
peso lin as (ao 3), "Ya" 





println ( 
by(data, :Grade, :Price => 
x -» ( NumSold-length(x), AvgPrice=mean(x)) ) 


) 
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3x4 DataFrame 


























Row Name | Date | Grade | Price | 
String | Date | String | Int64 | 
1 STEPHEN | 2008-02-11 | D |, 33L55™) 
JACKELINE | 2008-02-12 | E | 8257 | 
3 ARDELL | 2008-03-03 | C | 46911 | 














3x4 DataFrame 





Row Name Date Grade | Price | 
String Date String | Int64 | 
1 MARYANNA 2008-09-14 A | 79700 | 
NOE 2008-08-15 A | 79344 | 
3 SAMMIE 2008-11-05 B | 61730 | 








3x4 DataFrame 

















Row Name Date Grade | Price | 
String Date Categorical | Int64 | 
1 MARYANNA 2008-09-14 A | 79700 | 
ASHELY 2008-08-05 E | 24311 | 
3 TANJA 2008-12-01 C | 47052 | 


5x3 DataFrame 











Row Grade NumSold AvgPrice 
Categorical Int64 Float64 
1 A 15 76606.7 
2 B 19 59873.9 
3 G 33 45285.8 
4 D 35 34656.8 
5 E 51 20492.5 

















In line 2 dropmlissing() is used so that all rows with missing entries are excluded. This is done as 
some of the functions here require all values to be non-missing. In line 4 the Date () function from the 
Dates package is applied to every row from the :Date column, converting each entry from a string 
to a Date type, according to the string formatting given as the second argument. In line 5 sort () 
is used to sort by the :Date column. In line 7 filter() is used to return only rows which have 
a :Price greater than 50000. In line 9 the type of the :Grade column is changed to categorical 
via categorical! (). In lines 13-14 the powerful by () function is demonstrated. Here data is 
split according to :Grade. The third argument is where calculations are defined. The columns to 
be referenced in the calculations are put to the left of ‘=>’, in our case only :Price is used. The 
calculations are specified by the anonymous function in line 14. Note that => is used to define a Pair 
and -» is used to define an anonymous function. The anonymous function creates a NamedTuple 
defining two new columns, :NumSold and :AvgPrice. For the first, the total number of each 
: Grade is calculated based on the length, i.e. number of entries in the price column. For the second, 
the average price is calculated via mean (). Note that the by () function can be used in many ways, 
and calculations can be done over data in more than one column. There are also several other related 
functions not touched on here, including mapcols (), which can be used to transform all values in a 
data frame, and aggregate (), which has functionality similar to by (). Further functionality is also 
available via the DataFramesMeta package which provides a macro-based framework to interface 
with data frames, such as via the @ling macro and the |> operator. Consult the documentation for 
further information. 
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A Cleaning and Imputation Example 


In practice, it is not uncommon for data sets to contain many missing values, or require a certain 
amount of cleaning before one can use the data, see for example [TKNS20]. Furthermore it may not 
always be practical to simply exclude every observation which has a missing value. For example, 
if we were to simply exclude all rows with missing values in purchaseData.csv, then we would 
loose almost 25% of the data set. 


Instead of simply deleting rows, one way of dealing with missing values is to use imputation. 
This involves substituting missing entries with values based on either the data observed, or according 
to some other logic. Various methods can be used to impute missing values, and care must be taken 
when imputing, as it can lead to bias in the data. The exact type of imputation scheme used 
should ideally take into account both the nature of the data and the eventual statistical analysis. 
A comprehnsive treatment is in [V12]. 


We now present an example of one way in which one might consider cleaning and imputing a 
data set. In Listing [4.10] below we clean the data and impute missing values. First, we replace all 
missing names with the string ‘QQ’ and all missing dates with the string ‘31/06/2008’. We then 
calculate the average price of each grade based on the data available, and use these averages to 
impute missing entries in both the price and grade columns. 


Listing 4.10: ¡Cleaning and imputing data 


using DataFrames, CSV, Statistics 
data = CSV.read("../data/purchaseData.csv") 


rowsKeep = .! (ismissing. (data.Grade) .& ismissing. (data.Price) ) 
data = data[rowsKeep, :] 


replace! (x > ismissing(x) ? "QQ" : x, data.Name) 
replace ls => ismissliag (x) Y Vsil/OS/2008" $ sz, Clero. Dele) 


grPr = by(dropmissing(data), :Grade, :Price=>x -> 
AvgPrice = round(mean(x), digits=-3)) 


gl — Dice (cuasi 2,11] => GePrls,2 
nearIndx(v, x) = findmin(abs. (v.-x)) [2] 
for i in 1:nrow(data) 
if ismissing(data[i, :Price]) 
data[i, :Price] = d[data[i, :Grade] ] 

end 

if ismissing(data[i, :Grade]) 
data[i, :Grade] = grPr[ nearIndx(grPr[:,2], data[i, :Price]), :Grade] 





end 
end 
jenn iia (First (data, 5), “ini” 
printin (describe (data) ) 
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5x4 DataFrame 

















Row Name Date Grade Price 
String String String Int 64 
1 MARYANNA 14/09/2008 A 79700 
2 REBECCA 11/03/2008 B 60000 
3 ASHELY 5/08/2008 E 24311 
4 KHADIJAH 2/09/2008 D 38904 
5 TANJA 1/12/2008 C 47052 
4x8 DataFrame 
Row variable mean min median max nunique nmissing eltype 
Symbol Union Any Union Any Union Int64 Union 
1 Name ABBEY ZACHARY 183 0 Union{Missing, String} 
2 Date 1/07/2008 9/10/2008 142 0 Union{Missing, String} 
3 Grade A E 5 0 Union{Missing, String} 
4 Price 40037.9 8257 38045.5 79893 0 Union{Missing, Int64} 























In lines 4-5 we check if there are any rows with missing values in both the :Grade and :Price 
columns, and we remove them if present. First ismissing() is applied element wise over all values 
in each column, . & is then used to evaluate to true if both columns contain missing, and finally the 
preceding .! is used to flip the result, evaluating to t rue if the row should be kept. In our example 
there are no rows with missing values in both columns, so all rows are kept. In lines 7-8 we replace 
all missing names with the strings "QQ" and "31/06/2008" respectively via replace! (). In lines 
10-11 dropmissing() and by() are used to calculate the mean price of each group, excluding 
rows with missing values. The results are rounded to the nearest thousand (digits = -3) and 
stored as the data frame grPr. In line 14 the dictionary d is created based on the values from grPr, 
with grade the key, and average price the value. In line 14 the nearIndx () function is created. It 
takes a value as input, x, and then finds the index of the nearest value from a given vector of values, 
v. In lines 15-22 we loop over each row in the data frame, and impute missing values in the price 
and grade columns. In lines 16-18 if the price entry is missing, then the grade is used to return the 
corresponding value stored in the dictionary d. Similarly, in lines 19-21 if the grade entry is missing, 
then the nearIndx () function is used to find the index of the closest value in grPr based on the 
price in data, and then missing is replaced by the corresponding grade. In lines 24-25 the first 
several rows of the data frame are printed, along with a summary of the cleaned data frame. At this 
point, no missing values are present. 





4.2 Summarizing Data 


Now that we have introduced data frames and methods of data processing we can explore basic 
methods of descriptive statistics to obtain data summaries. We focus on numerical data in a single 
sample, observations in pairs, and observations in vectors. 


Single Sample 


Given a set of observations, 1],...,T,, we can compute a variety of descriptive statistics. The 
sample mean is often the most basic and informative measure of centrality. It is denoted by T and 
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is given by, 

n 
2% 
i=1 


n 


= 


It is the arithmetic mean of the observations. However, this term, ‘arithmetic mean’, is not often 
used, unless we want to disambiguate it with the geometric mean or the harmonic mean, each 
respectively calculated via, 
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P 
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'These two other Pythagorean means are not as popular in statistics as the arithmetic mean, however 
they are occasionally useful. 








TL 
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The geometric mean is useful for averaging growths factors. For example if xı = 1.03, x» = 1.05 
and z3 = 1.07 are growth factors, the geometric mean, Ty = 1.049714 is a good summary statistic 
of the ‘average growth factor’. This is because the growth factor obtained by T equals the growth 
factor 11 - 19-13. Hence we if we start with an original base level of say 100 units (e.g. dollars) and 
exhibit growths of z1, x2, and x3 above in three consecutive periods, then after three periods we 
have: 

Value after three periods = 100 - 2x1 - £2 - 13 = 100 - a — 115.7205. 


Here the average growth factor is rg. Using the arithmetic mean, z = 1.05 in such a case would 
yield 115.7625, which is slightly off from the correct value. Hence, in such scenarios, using the 
arithmetic mean to describe ‘average growth’ is not adequate. 


The harmonic mean is useful for averaging rates or speeds. For example assume that you are 
on a brisk hike, walking 5 km up a mountain and then 5 km back down. Say your speed going up 
is 11 = 5 km/h and your speed going down is x2 = 10 km/h. What is your “average speed’ for the 
whole journey? You travel up for 1 hour and down for 0.5 hours and hence your total travel time is 
1.5 hours. Hence the average speed is 10/1.5 = 6.66 km/h. This is not the arithmetic mean which 
is 7.5 km/h but rather exactly equals the harmonic mean. 


Also note that for any dataset z; < Ty < m. Here the inequalities become equalities only if all 
observations are equal. For n — 2, the second inequality can be obtained by manipulating the basic 
inequality 0 < (xı — z3)?. Then for higher n it can be obtained by induction. The first inequality 
can then be obtained from the second inequality, since the harmonic mean is the reciprocal of the 
arithmetic mean of reciprocals. 


A different breed of descriptive statistics is based on order statistics. This term is used to 
describe the sorted sample, and is sometimes denoted by, 


Based on the order statistics we can define a variety of statistics such as the minimum, z(1), the 
maximum, T(n), and the median, which in the case of n being odd is £((n+1)/2) and in case of n 
being even is the arithmetic mean of z(,/5, and t(n/2+1) Like the sample mean, the median is a 


146 CHAPTER 4. PROCESSING AND SUMMARIZING DATA - DRAFT 


measure of centrality. It is often preferable due to the fact that it isn’t influenced by very high or 
very low measurements. 


Related statistics are the a-quantile, for a € [0,1] which is effectively x(q), where an denotes 
a rounding of an to the nearest element of {1,...,n}, or alternatively an interpolation similar to 
the case of the median with n even. For a = 0.25 and a = 0.75, these values are known as the 
first quartile and third quartile respectively. Finally the inter quartile range (IQR) is the difference 
between these two quartiles and the range is x(n) — X(1). 


The range and the IQR are measures of dispersion, meaning that the greater their magnitude, 
the more spread in the data. When dealing with measures of dispersion, the most popular and 
useful measure is the sample variance, 


n n 
» i - 7)? Ya -n7 
El 


n=—l1 = n—1l 





The sample variance is approximately the arithmetic mean of squared deviations from the sample 
mean, but isn’t exactly because we divide by n—1 and not n (the latter is sometimes called population 
variance). If all observations are constant then s? = 0, otherwise s? > 0, and the bigger it is, the 
more dispersion we have in the data. A related quantity is the sample standard deviation s where 
s :— V's?. Also of interest is the standard error s/y/n. Variances, standard deviations, and standard 
errors play a major role in the chapters that follow. 


In Julia, functions for these descriptive statistics are implemented in the built-in Statistics 
package, with some additional functionality in the StatsBase package. Listing illustrates 
their usage. 


Listing 4.11: Summary statistics 


using CSV, Statistics, StatsBase 
data = CSV.read("../data/temperatures.csv") [:,4] 


println("Sample Mean: ", mean (data) ) 
println("Harmonic <= Geometric <= Arithmetic ", 
(harmmean (data), geomean (data), mean(data) ) ) 
"Sample Variance: ",var (data) ) 





printl 
println("Sample Standard Deviation: ",std(data) ) 
println("Minimum: ", minimum(data) ) 


"Median: ", median (data) ) 

"95th percentile: ", percentile(data, 95)) 
N95 cuenmeiles Y, cuello (cota, (0595) }) 
"Interquartile range: ", iqr(data),"\n") 


n 
printin 
joxe igne ILim 
printin 

in 





n ( 
( 
( 
println("Maximum: ", maximum(data) ) 
( 
( 
( 
( 


joe aLione 


summarystats (data) 





Sample Mean: 27.1554054054054 

Harmonic <= Geometric <= Arithmetic (26.52, 26.84, 27.155) 
Sample Variance: 16.12538955837281 

Sample Standard Deviation: 4.015643106449178 

Minimum: 16.1 

Maximum: 37.6 
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Median: 27.7 

95th percentile: 33.0 

0.95 quantile: 33.0 

Interquartile range: 6.100000000000001 


Summary Stats: 


Length: 777 
Missing Count: 0 

Mean: 27.155405 
Minimum: 16.100000 
lst Quartile: 24.000000 
Median: 27.700000 
3rd Quartile: 30.100000 
Maximum: 37.600000 





In line 2 we load the data and select the fourth column. This sets data to be an array of Float 64. 
In line 4 we compute and print the sample mean using mean (). We then compare it to the harmonic 
mean and geometric mean, computed via harmmean () and geomean () respectively. In line 7 we 
compute the sample variance using var () and then in line 8 the sample standard deviation via std (). 
In lines 9-14 we compute different statistics associated with order statistics including the min, max, 
median, and quartiles. Finally, in line 16 we use the summarystats () function which yields similar 
output. 





Observations in Pairs 


When data is configured in the form of pairs, (11,Y1),..., (Ln, Yn), we often consider the sample 
covariance, which is given by, 


n 


Y (ei) - g) 





COVa y = ; 4.1 
T — (41) 
where z and y are the sample means of (z1,...,24) and (yi,..., Yn) respectively. A positive co- 


variance indicates a positive linear relationship meaning that when x is larger than its mean, we 
expect y to be larger than its mean, and similarly when z is small, y is small. A negative covariance 
indicates a negative linear relationship meaning that when zx is large then y is small, and when z is 
small then y is large. If the covariance is 0 or near 0, it is an indication that no such relationship 


holds. 


However, like the variance, the covariance is not a normalized quantity and hence depends on 
the units of measurement. For example say we were measuring z and y using kilograms and meters 
respectively and obtain cov,,, = 0.003. Assume that we then decided to modify the data and 
represent x in grams by multiplying the original x values by 1000. From you can observe that 
the covariance would change to cov;, = 3. If one was to naively interpret these numbers, in the 
first case it may appear that there is almost no positive linear relationship, while in the second case 
it appears that a positive linear relationship holds. However, nothing changed in the data except for 
the units of measurement and any relationship existing in the first data set (kilograms vs. meters) 
should also exist in the modified one (grams vs. meters). 
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For this reason we define another useful statistic, the sample correlation (coefficient): 





~ COV z y 
Pry = , 
Sy Sy 
where s, and s, are the sample standard deviations of the samples 71,...,%, and yi,..., Yn re- 


spectively. Using the Cauchy-Schwartz inequality, we can show that Ôr € [-1,1]. The sign of 
fz,y agrees with the sign of cov; y, however importantly its magnitude is meaningful. Having |/z,y| 
near 0 implies little or no linear relationship, while |. y| closer to 1 implies a stronger linear re- 
lationship, which is either positive or negative depending on the sign of Py. Also note that that 
if (21,..., 24) = (y1,---,Yn) then the sample covariance is simply the sample variance. That is 


CN z = 82. 


It is often useful to represent the variances and covariances in the sample covariance matrix as 


follows, 
x COVE COVE, s2 Pag Se Sy 
S=] = =i). . (4.2) 
COVg,y  COVy y Îr y Sx Sy Sy 


In Listing [4.12] we import a weather observation dataset containing pairs of temperature obser- 
vations (see Section |3.7). We then estimate the elements of the covariance matrix, and then store 
the results in the file mvParams.31. Note that this file is used as input to Listing [3.34] at the end 
of Chapter 


Listing 4.12: [Estimating elements of a covariance matrix 


using DataFrames, CSV, Statistics 


data = CSV.read("../data/temperatures.csv", copycols=true) 
brisT = data.Brisbane 
gcT = data.GoldCoast 


sigB = std(brisT) 
sigG site! (qe 1) 
covBG = cov(brisT, gcT) 


meanVect = [mean(brisT) , mean(gcT) ] 
covMat = [sigB^2  covBG 
COVBG sigG^2] 


outfile = open("../data/mvParams.jl","w") 

write (outfile, "meanVect = $meanVect \ncovMat = $covMat") 
close (out file) 

print (read("../data/mvParams.j1", String) ) 





meanVect = [27.1554, 26.1638] 
covMat = [16.1254 13.047; 13.047 12.3673] 
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In lines 3-5 we import the data and store the temperatures for Brisbane and Gold Coast as the arrays 
brisT and gcT respectively. In lines 7-8 the standard deviations of our temperature observations are 
calculated, and in line 9 the cov () function is used to estimate the covariance. In line 11 the means 
of our temperatures are calculated and stored as the array meanVect. In lines 12-13 the covariance 
matrix is calculated and assigned to the variable covMat. In lines 15-17 we save meanVect and 
covMat to the new Julia file, mvParams.j1. Note that this file is used as input for our calculations 
in Listing |3.34| First, in line 15 the open() function is used (with the argument w) to create the 
file mvParams.31 in write mode. Note that open() creates an input-output stream, outfile, 
which can then be written to. Then in line 16 write () is used to write to the input-output stream 
outfile. In line 17 the input-output stream outfile is closed. In line 18 the content of the file 
mvParams.jl is printed via the read() and print () functions. 





Observations in Vectors 


We now consider data that consists of n vectors. The ith data vector represents a tuple of 
values, (£i1,.--, Zip). In this case, the data can be represented by a n x p data matrix, X, where 
the rows are observations (data vectors) and each column represents a different variable, feature or 
attribute. Such a layout is natural if considering the data as part of a data frame, see Section [4.1] 
However, in other cases, you may see this data matrix transposed such that each observation is a 
column vector of features. 


In summarizing the data matrix X, a few basic objects arise. These include the sample mean 
vector, sample standard deviation vector, sample covariance matriz, and the sample correlation 
matriz. We now describe these. 


The sample mean vector is simply a vector of length p where the j’th entry, z; is the sample 
mean of (%1;,...,Unj), based on the j'th column of X. Similarly the sample standard deviation 
vector has a j'th entry, sj, which is the sample standard deviation of (x15, ..., Enj). 


With these we often standardize (also called normalize) the data by creating a new n x p matrix 
Z, with entries, 


gue LA. que ouis Pd (4.3) 
5j 

sometimes called z-scores. It can be created via, Z = (X — 137) diag(s)7!, where 1 is a column 
vector of 1's, z is the mean vector, s is the standard deviation vector, and diag(-) creates a diagonal 
matrix from a vector. The standardized data has the attribute that each column, (21;,..., Znj)”, 
has a 0 sample mean and a unit standard deviation. Hence first and second order information 
of the j’th feature is lost when moving from the data matrix X to the standardized matrix Z. 
Nevertheless, relationships between features are still captured in Z and can be easily calculated. 


Most notably, the sample correlation between feature jı and feature j2, denoted by Py, ja is simply 


calculated via, 
n 
> Ziji ijo 
i=l 1 


Pjija = n—1 -[— 
Here the second expression means taking the j1, j2 entry from the p x p sample correlation matrix 
ZT Z/(n — 1). In Julia this can be performed via the cor () function. 





Zz o. (4.4) 
J1 J2 
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Without resorting to standardization, it is often of interest to calculate the p x p sample co- 
variance matriz Y generalizing (4.2). Here the ji, ja entry is the covariance between the jı and ja 
variables. The matrix can be computed in several ways. For example, 

1 
n=1 


y= (X = 127)? (X — 12") = —X'(r—mn '117)x. (4.5) 


n=—l1 





In the second expression, I is the identity matrix and as before 1 is a vector of 1's and hence 117 
is a matrix of 1's. Note that (I — n-! 117) X is the de-meaned data. Also note that the symmetric 
matrix (1—n7*117) multiplied by itself is itself. In Julia, this calculation can be performed via the 
cov () function. We now illustrate several alternative ways for computing the sample covariance 
and sample correlation in Listing below. In addition to the cov(), cor(), mean(), and 
std() functions, the listing also illustrates the use of the zscore () function from StatsBase. 


Listing 4.13: [Sample covariance 


using Statistics, StatsBase, LinearAlgebra, DataFrames, CSV 
df = CSV.read("../data/3featureData.csv",header=false) 

n, p = size(df) 

println("Number of features: ", p) 

println("Number of observations: ", n) 

X = convert (Array{Float64,2},df) 

println("Dimensions of data matrix: ", size(X)) 


0 -q1O»C0v 4 C5 2 -— 


xbarA = (1/n) *X’ xones (n) 

xbarB = [mean(X[:,i]) for i in 1:p] 

xbarC = sum(X,dims-1)/n 

println("\nAlternative calculations of (sample) mean vector: ") 
@show(xbarA), @show(xbarB), @show(xbarC) 


= (I-ones (n,n) /n) «xX 
println("Y is the de-meaned data: ", mean(Y,dims=1) ) 


covA (X .— xbarA')'x(X .— xbarA')/(n-1) 

covB Y'xY/(n-1) 

Some [Cows ly <lep31) tere at sn leo, a a len 

Go = [cor Cerile Klsp31)reta ls, a) seed les) Sos x a iso, y ssa dispo] 


Cove = cov (x) 





println("\nAlternative calculations of (sample) covariance matrix: ") 
@show(covA), @show(covB), @show(covC), @show(covD), @show(covE) 








macs = [6,3] = means, a) )/SeelG@k 8531) See a sum lem, 3 lia dep ] 

Americ) = laca ((ZScore ls y 31) tere y som d 999] acc) 

printin("\nAlternate computation of Z-scores yields same matrix: ", 
maximum (norm(ZmatA-ZmatB))) 

Z — ZmatA 


Cow = Esa ./ Esc ex 11) :staécle, 31) Ses ET DE 3 aka Les] 

Ear e cora o4 (stes, clas = 1) sel ems = 3) 

core sore, alce Tl) Be 3L sin lg, 3 sue Leal 

corD Z' xZ/ (n-1) 

Core cov (Z) 

COSE cor (X) 

println("\nAlternative calculations of (sample) correlation matrix: ") 
@show(corA), @show(corB), @show(corC), @show(corD), @show(corE), @show(corF) ; 
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Number of features: 3 
Number of observations: 7 
Dimensions of data matrix: (7, 3) 


Alternative calculations of (sample) mean vector: 
xbarA = [1.05714, 2.08571, 3.5] 























xbarB = [1.05714, 2.08571, 3.5] 

xbarC = [1.05714, 2.08571, 3.5] 

Y is the de-meaned data: [6.74064e-17 3.24889e-16 2.85486e-16] 

Alternative calculations of (sample) covariance matrix: 

covA = [0.119524 -0.087381 0.44; -0.087381 0.121429 -0.715; 0.44 -0.715 8.03333] 

covB = [0.119524 -0.087381 0.44; -0.087381 0.121429 -0.715; 0.44 -0.715 8.03333] 

cove = [0.119524 -0.087381 0.44; -0.087381 0.121429 -0.715; 0.44 -0.715 8.03333] 

covD = [0.119524 -0.087381 0.44; -0.087381 0.121429 -0.715; 0.44 -0.715 8.03333] 

covE = [0.119524 -0.087381 0.44; -0.087381 0.121429 -0.715; 0.44 -0.715 8.03333] 

Alternate computation of Z-scores yields same matrix: 2.220446049250313e-16 

Alternative calculations of (sample) correlation matrix: 

corA = [1.0 -0.725319 0.449032; -0.725319 1.0 -0.723932; 0.449032 -0.723932 1.0] 

corB = [1.0 -0.725319 0.449032; -0.725319 1.0 -0.723932; 0.449032 -0.723932 1.0] 

corC = [1.0 -0.725319 0.449032; -0.725319 1.0 -0.723932; 0.449032 -0.723932 1.0] 

corD = [1.0 -0.725319 0.449032; -0.725319 1.0 -0.723932; 0.449032 -0.723932 1.0] 

corE = [1.0 -0.725319 0.449032; -0.725319 1.0 -0.723932; 0.449032 -0.723932 1.0] 

corF = [1.0 -0.725319 0.449032; -0.725319 1.0 -0.723932; 0.449032 -0.723932 1.0] 
In line 2 we read the data with header - false since there isn't a line in the csv for the variable 


(or feature) names. In line 3 we use the size() function to set the number of observations, n, and 
number of features p. The convert () function is used in line 6 to extract a data matrix out of 
the data frame df. Lines 9-13 show alternative ways of computing the sample mean vector. Note 
the use of dims-1 in the sum() function in line 11, indicating to sum over columns. In line 15 we 
create the de-meaned data Y, and show in line 16 that the mean is 0 (effectively 0 in the output). 
Lines 18-24 illustrate a variety of ways to calculate the sample covariance matrix using several forms 
of (4.5). Lines 26-30 deal with standardized data as in (4.3). The printout of the maximum of the 
norm in line 29 is a way for seeing that the two matrices ZmatA and ZmatB are identical. Finally, 
lines 32-39 compute the correlation matrix in a variety of ways. Observe line 35 implementing (4.4). 
Also observe that the covariance of Z is the correlation of X as shown in lines 36-37. 





4.3 Plots for Single Samples and Time Series 


In this section we deal with plots focused on a single collection of observations (numbers), 
T1,..., Tp, where in certain cases we plot several such single collections jointly for comparison. If 
the observations are obtained by randomly sampling a population, then the order of the observa- 
tions is inconsequential. In this case we say the data is a single sample and in general, plotting 
the observations one after the other isn't particularly useful. However, if the observations repre- 
sent measurement over time then we call the dataset a time-series, and in this case plotting the 
observations one after the other is the standard way for considering temporal patterns in the data. 
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Figure 4.1: A manually created histogram with the same number of bins as 
histogram(). Both are compared to original PDF. 


Histograms 


In both the single sample and time series case, considering frequencies of occurrences is generally 
an insightful way to visualize the data. The most standard mechanism for plotting frequencies is 
the histogram, already used extensively in previous chapters (see for example Listing [.11]. Math- 
ematically, a histogram can be defined as follows. First denote the support of the observations via 
[£, m] where £ is the minimal observation and m is the maximal observation. Then the interval [£, m] 


is partitioned into a finite set of bins B1,..., Br, and the frequency in each bin is recorded via, 
1 TL 
= 2 MED for j=1,..., L. (4.6) 
j= 


Here 1{-} is 1 if z; € Bj or 0 if x; ¢ Bj. We have that >] f; = 1 and hence fi,..., fr is a discrete 
probability distribution. 


A histogram is then just a visual representation of this discrete probability distribution (the 
frequencies). One way to plot the frequencies is via a stem plot (see for example Figure 
illustrating a binomial distribution). However such a plot would not represent the widths of the 
bins. Hence an alternative representation is via a histogram function h(x) which is a scaled plot 
of the frequencies fi,...,fr. The function h(x) is defined for any x € |£, m]. It is constructed by 
staying constant on all values x € Bj, at a height of f;/|B;| where |B;| is the width of bin j. This 
ensures the total area under the plot is 1. Hence h(x) is actually a probability density function, 
and can hence be compared to probability densities. Mathematically, 


L 


)-Y geni 


foa) 


; 4.7 
Bua) a 





where b(x) is the bin of z, that is x € By. 


Note that there are a multitude of ways for choosing the number of bins L and the actual bins 
Bi,..., Br. Different histogram implementations will use different bin selection heuristics. We don’t 
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discuss these methods here. Throughout this book we use the histogram() function from Plots 
and it contains a default bin selection heuristic. In certain cases we specify the number of bins using 
the keyword argument, bins, often making a judicious choice based on the appearance for a specific 
dataset. It is important to keep in mind that histograms are clearly not unique representations of 
the data, because there is not a unique way to choose the bins. 


For demonstration purposes we implement a manual histogram in Listing and compare it 
to histogram() from plots. The results are in Figure The demonstration illustrates that 
a histogram is not a unique representation of the data. Both the manually created histogram 
and the built-in histogram have L bins, however a different choice of actual bins creates different 
histograms. Note that if in line 21, L is replaced with first. (bins), then the two histograms will 
agree because they use the exact same bins. Also note that our implementation is not an efficient 
one, but rather aims to illustrate the use of the above equations directly. A related classic plot that 
we do not survey here is the stem and leaf plot. 


Listing 4.14: [Creating a manual histogram 


using Plots, Distributions, Random; pyplot () 
Random. seed! (0) 


n = 2000 
data = rand (Normal (),n) 
l, m = minimum (data), maximum (data) 


delta = 0.3; 
bins = [(x,x+delta) for x in l:delta:m-delta] 
TE Vast (bans) 2m 
push! (bins, (last (bins) [2],m) ) 
end 
length (bins) 


inBin(x,j) = first (bins[j]) <= x && x < last (bins[j]) 
sizgenila (5) = lase (olas (51) = iae (lolas 31) ) 

f(j) = sum([inBin(x,j) for x in data])/n 

MES) = sum (>) /sicsesia(a) + ainsi (E, 3) kere y Suey 1815] ) 


xGrid = -4:0.01:4 
histogram(data,normed=true, bins=L, 
label="Built-in histogram", 
c=:blue, la=0, alpha=0.6) 
plot! (xGrid,h.(xGrid), lw=3, c=:red, label="Manual histogram", 
xlabel="x", ylabel="Frequency") 
plot! (xGrid, pdf. (Normal (),xGrid),label="True PDF", 
lw=3, c=:green, xlims=(-4,4), ylims=(0,0.5)) 





In lines 4-6 we deal with the data. It is artificially sampled from a standard normal distribution. Lines 
8-13 detail our choice of bins. In this case, L is implicitly defined based on the bin width delta. The 
statement in line 11 is executed when 1-m is not a multiple of delta and adds an additional final 
bin (potentially smaller than the rest of the bins). The function inBin() implements 1{x € B;}. 
The function sizeBin() implements |B;|. The function f () implements f; as in (4.6). We then 
use these in line 18 to implement h(x) as in the first representation of (4.7). Lines 21-23 plot the 
histogram using histogram () where we specify L bins. Lines 24-25 plot our manual implementation 
of the histogram via h (). For comparison, we also plot the PDF of the data in lines 26-27. 
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Figure 4.2: Histogram of the underlying data, and the KDE, as generated 
from the density () function in StatsPlots. 


Density Plots and Kernel Density Estimation 


A more modern and visually appealing alternative to histograms is the smoothed histogram, also 
known as a density plot, often generated via a kernel density estimate. Before we describe and detail 
kernel density estimation, let's see how to use it to create a smoothed histogram as in Figure[4.2] The 
figure is generated by Listing [4.15] using the density () function from the StatsPlots package. 
The plot is compared to a histogram. Its usage is similar to the histogram() or stephist () 
functions from Plots, however the result is strikingly different, yielding a smooth curve. 


This example and the next are based on synthetic data from a mixture model. Such models are 
useful for situations where we sample from populations made up of heterogeneous sub-populations. 
Each sub-population has its own probability distribution and these are “mixed” in the process of 
sampling. At first a latent (un-observed) random variable determines which sub-population is used, 
and then a sample is taken from that sub-population. In terms of random variable generation, 
creating a mixture simply involves first randomly selecting which probability distribution is used, 
and then generating an observation from it. Also, the probability density function of the mixture is 
a convex combination of the probability density functions of each of the sub-populations. That is, 
if the M sub-populations have densities gi(x),..., g(x) with weights, p1,...,pm and > p; = 1, 
then the density of the mixture is, 


M 
f(x) => pigi(a). 
i=l 
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Figure 4.3: Left: KDE compared to the actual underlying PDF as well as a 
histogram. Right: KDE obtained via several bandwidths settings. 


Listing 4.15: |Classic vs. smooth histograms 


using Random, Distributions, StatsPlots; pyplot () 
Random. seed! (0) 


mul, sigmal = 10, 5 

mu2, sigma2 AQ, 12 

distl, dist2 = Normal (mul, sigmal), Normal (mu2,sigma2) 
fey = Wes 

mixRv() = (rand() <= p) ? rand(distl) : rand(dist2) 


n = 2000 
data = [mixRv() for _ in 1:n] 


density(data, c=:blue, label="Density via StatsPlots", 
xlims=(-20,80), ylims=(0,0.035) ) 
stephist! (data, bins=50, c=:black, norm=true, 
label="Histogram", xlabel="x", ylabel = "Density") 








Lines 4-8 deal with the mixture random variable. It is a mixture of two normal distributions, each 
with parameters as specified in lines 4-5. The mixture places a probability of p = 0.3 of being from 
the first distribution and hence a probability of 0.7 of being from the second. Line 8 defines the 
function that generates the mixture random variable. It evaluates to rand (dist1) with probability 
0.3 and rand(dist2) with probability 0.7. Lines 10-11 generate data samples from this mixture 
distribution. Lines 13-14 create the density plot. Lines 15-16 plot a histogram for comparison. 





How is a density plot created? The typical way is via kernel density estimation (KDE), which is 
a way of fitting a probability density function to data. When we used the density () function from 
StatsPlots above, KDE was implicitly invoked. However, in certain cases, we may wish to have 
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access to the estimated density. For this we use the kde() function from the KernelDensity 
package. Let us first explain how kernel density estimation works. 


Given a set of observations, £1,..., £n, the KDE is the function, 


fa) = Ya) as 


where K(-) is some specified kernel function and h > 0 is the bandwidth parameter. The kernel 
function is a function that satisfies the properties of a PDF. A typical example is the Gaussian 
kernel. 





1 
K(x) = er, 





2 
With such a kernel (or any other) the estimator f (x) is a PDF because it is a weighted superposition 
of scaled kernel functions centered about each of the observations. Like histograms, KDEs are not 
unique as they depend on the type of kernel function used and more importantly on the bandwidth 
parameter. A very small bandwidth implies that the density, 


EC). a» 





is very concentrated around z;. This in turn implies that the KDE (4.8) is comprised of a superpo- 
sition of very concentrated functions, one for each observation. In contrast, a very large bandwidth 
implies that the density around each observation has a very wide spread. This will make the 
KDE ‘smear’ over a wide range. Hence ideally, the bandwidth is not too small nor too large. The 
right hand plot of Figure [4.3] illustrates KDE with different choices of bandwidth. As can be seen 
when h = 0.5 the KDE appears to have multiple spikes. At the other extreme, when h = 10 the 
KDE is very ‘smeared’. 


For any value of h, it can be proved under general conditions that if the data is distributed 
according to some density f(-), then f (-) converges to f(-) when the sample size grows. Nevertheless, 
in practice, choosing the bandwidth is a key issue in the application of KDE. A default classic rule 
is Silverman's rule, which is based on the sample standard deviation of the sample, s. The rule is, 


4\ 1/6 
h= (3) sn zi 1.06sn- 1/5, 


There is some theory justifying this h in certain cases, and in other cases more advanced rules 
such as perform better. Listing 4.16] carries out KDE for the same synthetic data as the 
previous example. It generates Figure [4.3] where the left plot compares the KDE to the underlying 
PDF of the mixture and the right plot presents the effect of changing the bandwidths. 


4.3. PLOTS FOR SINGLE SAMPLES AND TIME SERIES 157 


Listing 4.16: |Kernel density estimation 


using Random, Distributions, KernelDensity, Plots; pyplot() 
Random. seed! (0) 


ui, Siemel = 10, 5 

mu2, sigma2 = 40, 12 

distl, dist2 = Normal (mul, sigmal), Normal (mu2,sigma2) 
= 0.3 

mixRv() = (rand() <= p) ? rand(distl) : rand(dist2) 

mixPDF (x) = pxpdf(dist1,x) + (1-p)*paf (dist2,x) 


oo -q1Oo cU R0 rb. - 


n = 2000 
data = [mixRv() for _ in 1:n] 


kdeDist = kde(data) 


still = —20 20, 18810) 
pdfKDE = pdf (kdeDist, xGrid) 





plot cerid, pdfKDE, c=:blue, label="KDE PDF") 
stephist! (data, bins=50, c=:black, normed=:true, label="Histogram") 
pl = plot! (xGrid, mixPDF.(xGrid), c=:red, label="Underlying PDF", 
xlims=(-20,80), ylims=(0,0.035), legend=:topleft, 
xlabel="X", ylabel = "Density") 














hVals = [0.5,2,10] 
kdeS = [kde(data,bandwidth=h) for h in hVals] 
plot (xGrid, pdf(kdeS[1],xGrid), c = :green, label= "h=$(hVals[1])") 
plot! (xGrid, pdf(kdeS[2],xGrid), c = :blue, label= "h=$(hVals[2])") 
p2 = plot! (xGrid, pdf (kdeS[3],xGrid), c = :purple, label- "h=S(hVals[3])", 
xlims=(-20,80), ylims=(0,0.035), legend=:topleft, 
xlabel="X", ylabel = "Density") 
plot (p1,p2,size = (800,400) ) 








The first 12 lines are similar to the previous code example with an exception of line 9 that defines the 
function mixPDF () which is the PDF of the mixture distribution. In line 14 we invoke the function 
kde () to generate a KDE type object kdeDist, based on data. The KernelDensity package 
supplies methods for the pdf () function that can be applied to UnivariateKDE objects such as 
kdeDist. This is used in line 17 to create the array pdfKDE over xGrid. Lines 19-23 plot the KDE, 
a histogram of the data, and the actual PDF. These plots make up p1 which is the left hand of the 
figure. The right hand side p2 is created in lines 25-32. 











Empirical Cumulative Distribution Function 


While KDE is a useful way to estimate the PDF of the unknown underlying distribution given 
some sample data, the Empirical Cumulative Distribution Function (ECDF) may be viewed as an 
estimate of the underlying CDF. In contrast to histograms and KDEs, ECDFs provide a unique 
representation of the data not dependent on tuning parameters, such as the bins for histograms, or 
the bandwidth and kernel function for KDE. 


The ECDF is a stepped function which, given n data points, increases by 1/n at each point. 
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Mathematically, given the sample, 11,...,Tp the ECDF is given by, 


n 
F(t) = y l{x; < t} where 1{-} is the indicator function. 
i=1 
In the case of ii.d. data from an underlying distribution with CDF F(-), the Glivenko-Cantelli 
theorem ensures that the ECDF F(-) approaches F(-) as the sample size grows. 


Constructing an ECDF is possible in Julia through the ecdf() function contained in the 
StatsBase package. In Listing we use synthetic data from the same mixture distribution 
as in the two previous examples. We obtain the ECDF for a sample of size n — 30 and then again 
for n = 100. We compare the ECDFs to the underlying actual CDF of the mixture distribution. 


See Figure 


Listing 4.17: |Empirical cumulative distribution function 


using Random, Distributions, StatsBase, Plots; pyplot () 
Random. seed! (0) 


mul, sigmal = 10, 5 

mu2, sigma2 40, 12 

distl, dist2 = Normal(mul,sigmal), Normal (mu2,sigma2) 
jo = 0,3 

mixRv() = (rand() <= p) ? rand(distl) : rand(dist2) 
misc (53) = jor@cle (esti, + (Lm) eel (elite 2, z) 


n = [30, 100] 
datal = [mixRv() for _ in 1:n[1]] 
data2 = [mixRv() for _ in 1:n[2]] 


empiricalCDF1 = ecdf (datal) 
empiricalCDF2 = ecdf (data2) 





xGrid = -10:0.1:80 
plot (xGrid,empiricalCDF1.(xGrid), c=:blue, label="ECDF with n = 
plot! (xGrid, empiricalCDF2.(xGrid), c=:red, label="ECDF with n = 
plot! (xGrid, mixCDF.(xGrid), c=:black, label="Underlying CDF", 
xlims=(-10,80), ylims=(0,1), 
xlabel="x", ylabel="Probability", legend=:topleft) 











The first few lines of the code block are similar to the previous examples using a mixture distribution. 
A difference is that in line 9 we define the function mixCDF () which is the CDF of the mixture 
distribution. We then generate two samples in lines 12-13, of varying sample sizes. In lines 15-16 we 
invoke the ecdf () function from StatsBase. The returned object can then be used as a function, 
evaluating F(-) at any point. This is done in lines 19-20 where we plot the ECDFs evaluated on 
xGrid. Then lines 21-23 plot the actual CDF. 





Normal Probability Plot 


We now introduce the normal probability plot. This plot can be used to indicate if it is likely 
that a data set has come from a population that is normally distributed or not. It works by plotting 
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Figure 4.4: The ECDF from a sample compared to the population CDF. 


the quantiles of the dataset in question against the theoretical quantiles that one would expect if 
the sample data came from a normal distribution, and checking if the plot is linear. The normal 
probability plot is actually a special case of the more generalized Q-Q plot, or quantile-quantile plot 
that is described in more detail in the next section. 


In order to create a normal probability plot, the data points are first sorted in ascending order, 
1,---,%, then the quantiles of each data point calculated. Finally, n equally-spaced quantiles of 
the standard normal distribution are calculated, and each ascending quantile pair is then plotted. 
If the data comes from a normal distribution then we can expect the normal probability plot to 
follow a straight line, otherwise not. An alternative view of the normal probability plot is to think 
of the ECDF of the data, plotted with the y-axis stretched according to the inverse of the CDF of 
the normal distribution. 


In Listing [4.30] we create Figure which presents two normal probability plots. It is based 
on two synthetic data sets that have a similar mean and a similar variance. The first comes from 
a normal distribution and the second from an exponential distribution. As can be seen, the “non- 
normality” of the data coming from the exponential distribution is very apparent. 


Listing 4.18: |Normal probability plot 


using Random, Distributions, StatsPlots, Plots; pyplot() 
Random. seed! (0) 


= 20 
dl, d2 = Normal (mu, mu), Exponential (mu) 





n = 100 
datal = rand(d1,n) 
data2 = rand(d2,n) 


1 
2 
3 
4 
5 
6 
Y 
8 





qqnorm(datal, c-:blue, ms=3, msw=0, label="Normal Data") 
agnorm! (data2, c=:red, ms-3, msw=0, label="Exponential Data", 
xlabel="Normal Theoretical Quantiles", 
ylabel="Data Quantiles", legend=true) 
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Figure 4.5: Comparing two normal probability plots. One from a normal 
population and one from an exponential population. 





The distributions for the two synthetic data sets are defined in lines 4-5. You can check that they have 
the same theoretical mean and variance by using mean () and var() on d1 and d2. The samples 
are then generated in lines 7-9. Lines 11-14 then plot the normal probability plots via the qqnorm() 
and qqnorm! () functions from StatsPlots. The second function has a ! in the name similar to 
other plotting functions that add onto an existing plot. 





Visualizing Time-Series 


Moving from single sample data to time-series, we now illustrate a basic example. We create 
time-series plots of two time-series together with an associated histogram. Later, we also show 
how a radial plot can help visualize cyclic temporal patterns on the same data. In general, when 
confronted with time series data, simply plotting a histogram of the data can be misleading because 
the frequencies in the histogram can be greatly affected by trends or cyclic patterns in the data. For 
example, the gross domestic product (GDP) of China has risen in the past 20 years from around a 
trillion US dollars (USD) to roughly 12.5 trillion USD. It does not make sense to plot a histogram 
of this data, because it would not capture the distribution of the GDP. 


Nevertheless, in cases where the time-series data appears to be stationary, then a histogram is 
immediately insightful. Broadly speaking, a stationary sequence is one in which the distributional 
law of observations does not depend on the exact time. This means that there isn't an apparent 
trend nor a cyclic component. To illustrate these concepts we present Figure [4.6] The top left plot 
presents two time-series of temperature data in the adjacent locations of Brisbane and Gold Coast 
Australia. As apparent from the plot, the sequences are clearly non-stationary. This is because of 
seasonality. The top right plot shows a zoomed in view of a specific fortnight. The bottom left 
plot is a time-series of the differences in temperatures between Brisbane and Gold Coast. On initial 
inspection, this time-series appears to be stationary. Hence for the difference we plot a histogram 
in the bottom right. The code for generating Figure [4.6] is in Listing [4.19] 
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Figure 4.6: Simple plots for time-series data. Top left: Temperatures over 
time. Top Right: Zooming in on a specific week. Bottom left: Differences in 
temperature. Bottom right: Histogram of the difference in temperature. 


Listing 4.19: [Multiple simple plots for a time-series 


using DataFrames, CSV, Statistics, Dates, Plots, Measures; pyplot () 


data = CSV.read("../data/temperatures.csv") 
brisbane = data.Brisbane 
goldcoast = data.GoldCoast 


diff = brisbane - goldcoast 
dates = [Date( 
Year (data.Year[i]), 
Month(data.Month[i]), 
Day (data.Day[i]) 
) for i in 1:nrow(data)] 


fortnightRange = 250:263 
brisFortnight = brisbane[fortnightRange] 
goldFortnight = goldcoast[fortnightRange] 


default (xlabel="Time", ylabel="Temperature") 
default (label=["Brisbane" "Gold Coast"] ) 


pl plot(dates, [brisbane goldcoast], 
c=[:blue :red]) 
p2 plot (dates[fortnightRange], [brisFortnight goldFortnight], 
c=[:blue :red], m=(:dot, 5, Plots.stroke(1))) 
p3 plot (dates, diff, 
c=:black, ylabel="Temperature Difference",legend=false) 
p4 histogram(diff, bins=-4:0.5:6, 
ylims=(0,140), legend = false, 
xlabel="Temperature Difference", ylabel="Frequency") 
plot (p1,p2,p3,p4, size = (800,500), margin = 5mm) 
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Figure 4.7: A radial plot of (time) averaged weekly and fortnightly 


temperatures for Brisbane in 2015. 





In lines 3-5 we read the data and create the arrays brisbane and goldcoast describing the tem- 
peratures in these respective locations. In line 7 we create the array diff made up of temperature 
differences. In lines 8-12 we create the array dates which contains Date objects mapped to the days 
of temperature measurement. It is constructed based on the Year, Month, and Day columns of the 
data frame by using the respective functions from the Dates package. In line 14 we define a range 
of days spanning a fortnight, fortnightRange. This is then used to splice that fortnight from the 
temperature data into brisFortnight and goldFortnight. In this plotting example we use the 
default () function from Plots to set some default argument for each subplot. This is in lines 
18-19. We then create the plots in lines 21-30, overriding the defaults in certain cases. 





Radial Plot 


It is often useful to plot time-series data, or cyclic data, on a so called radial plot. Such a plot 
involves plotting data on a polar coordinate system. See Figure This plot can be used to help 
visualize the nature of a dataset by comparing the distances of each data point radially from the 
origin. A variation of the radial plot is the radar plot, which is often used to visualize the levels of 
different categorical variables on the one plot. 


For our example of a radial plot we use the Brisbane temperature data in 2015, similar to the data 
plotted in the previous listing. This time, we present the effect of different forms of smoothing on 
the time-series data. For this we carry out a moving average on the data. Roughly, this transforms 


the original data sequence 23,...,Zn into a smoother sequence 71,..., Zn via, 
L-1 
T. = = Ti—j- (4.10) 
L4 
j=0 
Hence each 2; is the average of the L observations 2;_1+1,..., 2; neighbouring time i in the original 


sequence. A critical parameter is the window size L which determines “how much smoothing” is 
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to be performed. With L = 1 no smoothing takes places and as L is increased more smoothing 
takes place. There are also minor details that we don’t specify here associated with shifting the 
smoothing window and with edge effects. 


In Listing [4.20] we use the TimeSeries package to carry out such smoothing, comparing two 
window size values, L = 7 (weekly) and L = 14 (fortnightly). This package contains a variety of 
more advanced time series manipulation functions. Refer to the documentation for more examples. 
The listing then plots the smoothed data in Figure[4.7] Since Brisbane is in the southern hemisphere, 
you can observe that September - March temperatures are significantly higher than the temperatures 
during April-August. 


Listing 4.20: [Radial plot 


using DataFrames, CSV, Dates, StatsBase, Plots, TimeSeries; pyplot () 


data = CSV.read("../data/temperatures.csv",copycols = true) 
brisbane = data.Brisbane 
dates = [Date( 
Year (data.Year[i]), 
Month (data.Month[i]), 
Day (data.Day[i]) 
Y for GL in iknrow(data)] 


windowl, window2 = 7, 14 
dl = values (moving (mean, TimeArray (dates, brisbane) , window1) ) 
d2 = values (moving (mean, TimeArray (dates, brisbane) , window2) ) 





opatel = (2ous-Zpa/sossU) ux ¡951 /2 
monthsNames = Dates.monthname. (dates[1:31:365]) 


¡loe (ejsitel, cl fils 355] , 
c=:blue, proj=:polar, label="Brisbane weekly average temp.") 
plori (gric CRL: 3691, 
Seeks mod Ap S 8193/6 5()) < 1931/27, 231) Amon Ps Names) 
c=:red, proj=:polar, 





label="Brisbane fortnightly average temp.", legend-:outerbottom) 





Lines 3-9 are similar to the previous listing setting up brisbane as an array of temperature readings 
and dates as an array of dates. In line 11 we define windowl and window2 which specify the width 
of the moving average smoothing to be performed. Then lines 12-13 use several functions from the 
TimeSeries package to perform moving average smoothing. We first create TimeArray objects, 
we then use the moving () function with first argument mean, we then extract the values using the 
values () function. The results are in the arrays d1 and d2. In line 15 we specify the polar plotting 
grid. Note the use of .+ pi/2 shifting the range by 90 degrees. In line 16 we use the monthname () 
function from package Dates to get an array of month names for labels. The radial plots are generated 
in lines 18-23 using the argument proj=:polar. Notice the specification of xt icks in line 21 where 
we broadcast the mod () function with second argument 2pi. This ensures all angles are standardized 
to lie in the interval [0, 27]. 
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Figure 4.8: Left: Samples from two beta distributions. Right: A Q-Q plot 
comparing the samples. 


4.4 Plots for Comparing Two or More Samples 


Having covered plots for single sample data, we now introduce plots that are primarily designed 
for comparing two or more samples. As described at the start of this chapter, in the case of two 
samples, we denote the data z1,..., £n and y1,..., Ym- 


Quantile-Quantile (Q-Q) Plot 


The Quantile- Qunatile or Q-Q plot checks if the distributional shape of two samples is the same 
or not. For this plot we require that that the sample sizes are the same. Then the ranked quantiles of 
the first sample are plotted against the ranked quantile of the second sample. While mathematically 
a Q-Q plot is a parametric curve, in practice since sample sizes are finite, the points plotted are 
according to the points of the first sample - on the horizontal axis. In the case where the samples 
have a similar distributional shape, the resulting plot appears like a collection of increasing points 
along a straight line. However, in cases where the distributional shape varies, other patterns appear. 
Hence, Q-Q plots serve as a mechanism to compare the distributional shapes of two samples. 


A different variant of Q-Q plots is when the quantiles of a single sample are plotted against the 
quantiles of a theoretical distribution. One such plot is the normal probability plot covered in the 
previous section. In general, one can create such a plot of a single sample against any theoretical 
distribution. Refer to the documentation of qqplot() from StatsPlots for more information. 
Another variant is the Probability- Probability or P-P plot. Here, cumulative probabilities are used 
on the axes instead of quantiles. 


In Listing we generate Figure 4.8] which considers two synthetic samples from beta distribu- 
tions. The left plot presents histograms and the right a Q-Q plot. You may modify the parameters 
in the code and see how this affects the plot. 
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Listing 4.21: |Q-Q Plots 


using Random, Distributions, StatsPlots, Plots, Measures; pyplot () 
Random. seed! (0) 


ijl, 192 = 0,5 p 2 
distl, dist2, = Beta(bl,bl), Beta(b2,b2) 


n = 2000 
datal = rand(distl1,n) 
data2 = rand(dist2,n) 


stephist (datal, bins=15, label = "beta($b1,$b1)", c = :red, normed = true) 
pl = stephist! (data2, bins=15, label = "beta($b2,$b2)", 
c = :blue, xlabel="x", ylabel="Density",normed = true) 





p2 = aqplot(datal, data2, c=:black, ms=1, mew =0, 
xlabel-"Quantiles for beta($b1,$b1) sample", 
ylabel="Quantiles for beta($b2,$b2) sample", 
legend=false) 


plot (p1, p2, size=(800,400), margin = 5mm) 





Lines 4-5 define the distributions of the synthetic data and their parameters. Lines 7-9 create the 
sample data. Lines 11-13 create the histograms. Lines 15-18 call the qqplot() function from 
StatsPlots and create the Q-Q plot. 





Box Plot 


The box plot, also known as a box and whisker plot, is commonly used to visually draw conclusions 
of, and to compare two or more single sample datasets. It displays the first and third quartiles along 
with the median, i.e. the ‘box’, along with calculated upper and lower bounds of the data, i.e. the 
‘whiskers’, hence the name. The location of the whiskers is typically given by 


minimum = Q1 — 1.5/QR, maximum = Q3 + 1.5/QR, 


where IQR is the inter-quartile range (see Section 4.2). Observations that lie outside this range are 
called outliers. 


In Listing we present an example of the box plot, where we compare three datasets. The 
files machinel.csv,machine2.csv, and machine3.csv represent sample measurements of the 
diameter of identical pipes produced by three different machines. The diameters of the pipes vary 
due to imprecision of each machine but also potentially due to variability between the machines. 
Statistical analysis of this example via ANOVA is presented in Chapter |7} Section The listing 
produces Figure [4.9] and from this figure we can visually compare the three sample populations. 
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Figure 4.9: Box plots of pipe diameters associated with machines 1, 2, and 3. 


Listing 4.22: |Box plots of data 


using CSV, StatsPlots; pyplot () 


datal = CSV.read("../data/machinel.csv", header=false) [:,1] 
data2 CSV.read("../data/machine2.csv", header=false) [:,1] 
data3 = CSV.read("../data/machine3.csv", header=false) [:,1] 


boxplot ([datal, data2,data3], c=[:blue :red :green], label="", 
eli = (sis, (Vly, Way, Wai), sium. Ese”, 
ylabel="Pipe Diameter (mm) ") 











In lines 3-5 the data files for each of the machines are loaded and the data stored as separate arrays. 
In lines 7-9 the boxplot is created via the boxplot () function from the StatsPlots package. 





Violin Plot 


The violin plot is another plot that can be used to compare multiple sample populations. It 
is similar to the box plot, however the shape of each sample is represented by a mirrored kernel 
density estimate of the data. Listing [4.23] creates an example of this plot as shown in Figure 
Note this example uses the iris dataset from the RDatasets package. This dataset is further 
explored in the next Section. 


Listing 4.23: [Violin plot 


using RDatasets, StatsPlots 


iris = dataset ("datasets", "iris") 
@df iris violin(:Species, :SepalLength, 
fill=:blue, xlabel="Species", ylabel="Sepal Length", legend=false) 
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Figure 4.10: An example of a violin plot. 





In line 3 the iris dataset from the RDatasets package is loaded as a DataFrame via the dataset 
function. The first argument, "datasets", is the package in RDatasets which contains the "iris" 
dataset, which is the second argument. In line 4 the @df macro is used to plot the data from the 
dataframe directly, with the first argument :Species the horizontal axis, and the second argument 
:SepalLength the vertical axis. 





4.5 Plots for Multivariate and High Dimensional Data 


We now consider vectors of observations, (211,...,Ulp),-+-,(Un1,- ++, Unp), where n is the number 
of observations and p is the number of variables, or features. In cases where p is large the data is 
called high dimensional. In such cases, analysis of the data can be both challenging and rewarding. 
Such analysis is the focus of Chapters |8| and [9] where we focus on linear regression and machine 
learning. Analysis of multivariate data often goes hand in hand with visualization. Here the natural 
constraint is the fact that images (or plots) are limited to lie on the two dimensional plane, while 
in practice p is often much greater than 2 (denoted p > 2). 


We have already explored several basic plots associated with multivariate data. These include the 
surface plot and heat map first introduced in Figure[I.8] the contour plot introduced in Figure [8.26] 
and the scatter plot first introduced in Figure [1.12] We augment these by introducing the scatter 
plot matriz, heat map with marginals plot, and Andrews plot. Also related are plots generated by 
PCA as that presented in Figure ?? in Chapter [9] 


Note that with the exception of a basic animation example presented in Listing [1.12]in Chapter[1] 
we don't cover advanced animation methods, sound generation, interactive graphics or 3D printing. 
Nevertheless, the reader should keep in mind that when properly used, all these forms of media 
allow one to better visualize high dimensional data. This is still an emerging field and is bound to 
take on a more prominent role in the coming years. To this end, you may be interested in exploring 
a growing and diverse set of Julia packages including PlotLy.jl, VegaLite. jl, and others. 
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Scatter Plot Matrix 


The basic scatter plot is very common in data visualization. In its simplest form, when consid- 
ering pairs of observations (21, y1),---,(%n,Yn) it is a plot of these coordinates on the Cartesian 
plain. If in addition, the observations are labelled where each pair (x;, yi) has a label from a small 
finite set, then each point can be colored or marked with a symbol, matching the label. See for 
example Figure [3.25] from Chapter |3| as one of many examples of this type of plot. 


When moving from pairs to higher dimensions, each observation is represented as the vector 
or tuple (xj1,..., Tip). If p = 3 one may still try to illustrate a point cloud, however for higher 
dimensions this isn’t possible. In this case, one of the most popular plots for visualizing relationships 
is the scatter plot matriz. It consists of taking each possible pair of variables and plotting a scatter 
plot for that pair. This allows one to understand relationships between pairs of variables. With p 
variables there are p? total plots, where p of the plots are redundant because they plot a variable 
against itself (on the diagonal), and the other p? — p plots each contain a duplicate of the plots 
(with the axis reversed). Hence for example if p = 4 there are (p? — p)/2 = 6 important plots in the 
scatter plot matrix even though the 4 x 4 matrix has 16 plots in total. 


As an example, Listing creates Figure where we consider the iris data set that 
consists of four measurements for each flower: ‘sepal width’, ‘sepal length’, ‘petal width’, and ‘petal 
length’. Hence each flower can be considered as a tuple (xj, zio, 23, ;4). As with any scatter plot, 
data in scatter plot matrices can also be colored or labeled. In this example, there are 3 species, 
‘setosa’, ‘veriscolor’, and ‘virginica’, and each tuple is associated with a species. The listing output 
also summarizes basic information about the iris dataset. Inspection of Figure [4.11] can yield 
insight and conjectures about the population of flowers. 


Listing 4.24: |Scatterplot matrix 


using RDatasets, Plots, Measures; pyplot () 


data = dataset("datasets", "iris") 
println("Number of rows: ", nrow(data)) 





insertSpace(name) - begin 
i = findlast (isuppercase,name) 
name[1:i-1]x" "«xname[i:end] 
end 


featureNames = insertSpace. (string. (names (data))) [1:4] 
println("Names of features: NnNt", featureNames) 





speciesNames = unique (data.Species) 
speciesFreqs = [sn => sum(data.Species .== sn) for sn in speciesNames] 
println("Frequency per species: MnNt", speciesFregs) 








default(msw = 0, ms = 3) 


scatters = [ 
scatter(data[:,i], data[:,j], c=[:blue :red :green], group=data.Species, 
xlabel=featureNames[i], ylabel=featureNames[j], legend = i==1 && j==1) 
torn inea 

















plot (scatters..., size=(1200,800), margin = 4mm) 
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Figure 4.11: A scatter plot matrix of the iris dataset with observations 
grouped by species. Blue is Setosa, red is versicolor, green is virginica. 


Number of rows: 150 
Names of features: 

["Sepal Length", "Sepal Width", "Petal Length", "Petal Width", " Species"] 
Frequency per species: 

Pair{String, Int64}["setosa"=>50, "versicolor"=>50, "virginica"=>50] 








In line 3 we create the data frame and in line 4 we print the number of rows in it. In lines 6-9 we 
define a function that takes a string, name, that is assumed to be of the form "SepalWidth" as 
an example. Such are the names of columns in the iris dataset. The function then inserts white 
space prior to the last capital letter so as to convert the string to "Sepal Width". Notice the use of 
string concatenation using « in line 8. We then use this function in line 11 to create featureNames, 
an array of strings that is later used to label the variables. Note the use of names () in line 11, 
yielding an array of symbols. Lines 14-16 deal with the species and their frequency. The names of 
species are obtained in line 14 and their frequency is obtained in line 15. This is simply for purposes 
of summarizing these results in the output generated in line 16. In line 18 we use the default () 
function from Plots to set parameters used by all scatter plots. In line 20 we create a matrix of 
scatter plots. Note the use of group= in line 21 based on species. Also note the condition in line 22 
for presenting a legend only in the top left plot. The plots are then presented in a figure in line 25. 
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Heat Map with Marginals 


The heat map, first seen in {1.8} consists of a grid of shaded cells. Another name for it is a matriz 
plot. The colors of the cells indicate the magnitude, where typically, the ‘warmer’ the color, the 
higher the value. This is in a sense nothing but a monochrome image. 


In cases of pairs of observations (11,Y1),... (£n, Yn), the bivariate data can be constructed into 
a bivariate histogram in a manner similar to the (univariate) histogram implemented in Listing [4.14] 
In the bivariate case, we partition the Cartesian plain (or the subset containing the data) R?, into 
a grid of bins B;; for i = 1,..., L1 and j = 1,..., L2. Then we count the frequency of observations 
per bin via, 
n 


1 
fg == Mare By}, for i—-logh,  ¿=1,...,La. (4.11) 
n 


Compare this with (4.6) dealing with the univariate case. Now the Lı x L3 matrix composed of fij 
can be plotted as a heat map to yield a bivariate histogram. 


The marginalhist () function from StatsPlots implements this and goes even further 
to create and present marginal histograms. These are two separate univariate histograms, one for 
1,..., Zn and the other for y1, ..., Yn. Then, as shown in Figure[4.12] these histograms are presented 
on the margins of the heat maps, estimating the marginal distributions. See also Section [3.7] 


In Listing [4.25] we create Figure [4.12] that presents two variants of a heat map with marginals. 
The left plot is for the Brisbane and Gold Coast temperature data, also used in Listings 
Listing as well as others. The right plot is for synthetic data based on a bivariate normal 
distribution fitted to that data, with the actual parameters fit in Listing [4.12] Note that this data 
is also plotted as a time-series in Figure [4.6] Hence in interpreting it via a histogram, one needs to 
exercise caution due to the cyclic nature of the data. 


Listing 4.25: |Heatmap and marginal histograms 


using StatsPlots, Distributions, CSV, DataFrames, Measures; pyplot () 
realData = CSV.read("../data/temperatures.csv") 


= 10^5 
include ("../data/mvParams.31") 
biNorm = MvNormal (meanVect,covMat) 
syntheticData = DataFrame (rand (MvNormal (meanVect, covMat) ,N)’) 
rename! (syntheticData, [:xl=>:Brisbane, :x2 => :GoldCoast]) 


default (c=cgrad([:blue, :red]), 
xlabel="Brisbane Temperature", 
ylabel="Gold Coast Temperature") 


pl = marginalhist (realData.Brisbane, realData.GoldCoast, bins=10:45) 
p2 marginalhist(syntheticData.Brisbane, syntheticData.GoldCoast, bins-10:.5:45) 





plot(pl,p2, size = (1000,500), margin = 10mm) 
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Figure 4.12: Heat map with marginals comparing Brisbane and Gold Coast 
temperatures. Left: actual data. Right: synthetic multivariate normal data. 





In line 3 we create realData based on the Brisbane and Gold Coast temperature file. Lines 5-9 
create the syntheticData data frame with N observations based on the bivariate normal distribution 
biNorm using the parameters in mvParams. jl similarly to Listing |3.34| The actual creation of the 
DataFrame object in line 8 creates default column names, x1 and x2. We then rename these in line 9. 
The remainder of the code creates the two heat maps with marginals plots using marginalhist () 
in lines 15-16. Observe that for the synthetic data we are able to use a much larger number of bins. 
Note the use of the cgrad() function in line 11, setting the color gradient as part of the default 
parameters. 





Andrews Plot 


We now introduce a completely different way to visualize high-dimensional data. The idea is to 
represent a data vector (%;1,..., Tip) via a real valued function. For any individual vector, such a 
transformation cannot be generally useful, however when comparing groups of vectors, it may yield 
a way to visualize structural differences in the data. 


The specific transformation rule that we present here creates a plot known as Andrews plot. 
Here for the i'th data vector (%;1,..., Tip) we create the function f;(-) defined on [—7, 7] via, 

fit) = J5 + ziz sin(t) + ziz cos(t) + zia sin(2t) + xis cos(2t) + zis sin(3t) + ziz sin(3t) + --- , 
with the last term involving a sin() if p is even and a cos() if p is odd. The for i = 1,...,n, the 
functions f1(-),..., fa (-) are plotted. In cases where each i has an associated label from a small finite 
set, different colors or line patterns can be used. An example of this plot is shown in Figure 
where the results hint at similarities within species and differences between species. 


In Listing below, we present a standard example of Andrews plot based on the iris dataset. 
The resulting Figure indicates that differences exist between each of the species. 
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using RDatasets, StatsPlots; pyplot() 


iris = dataset (“datasets "iris") 
@df iris andrewsplot (:Species, cols(1:4), 
line=(fill=[:blue :red :green]), legend=:topleft) 








In line 4 the andrewsplot () function from StatsPlots is used to plot the data. Note the @df 
macro is used in a similar format to that of Listing [4.23] The first argument, : Species, determines 
how the data should be grouped, while the second argument determines what variables should be 
included in the calculation, in this case columns 1 to 4. 
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Figure 4.14: Two pie charts. 
4.6 Plots for the Board Room 


In this section we introduce more simple plots, such as those that one may typically see in 
business summaries, or news reports. We show how to create pie charts, bar charts, and stack plots. 
Although the plots covered here are not as technical as those covered previously, they are still useful 
as they can quickly convey information in a very clear manner. The examples in this section rely 
on data for three fictitious companies, stored in companyData.csv. 


Pie Chart 


We first look at the pie chart, which is a simple plot that conveys relative proportions. In 
Listing we construct two pie charts which show the relative market capitalization of each 
company A, B, and C for the years 2012 and 2016. The results are shown in Figure 


Listing 4.27: |A pie chart 


using CSV, CategoricalArrays, Plots; pyplot() 


= CSV.read("../data/companyData.csv") 
companies = levels (df.Type) 


year2012 = df[df.Year .-- 2012, :MarketCap] 
year2016 - df[df.Year .-- 2016, :MarketCap] 


pl pie(companies, year2012, title="2012 Market Cap \n by company") 
p2 = pie(companies, year2016, title="2016 Market Cap \n by company") 
plot(pl, p2, size=(800, 400) ) 





In line 4 levels() from the CategoricalArrays package is used to extract the name of each 
company as a level, and store them in the array companies. In lines 6-7 the market capitalization for 
each company is stored as arrays year2012 and year2016 for the years 2012 and 2016 respectively. 
In lines 9-10 the pie () function is used to create the pie charts. 
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Figure 4.15: A stacked bar plot (left), and non-stacked bar plot (right). 
Bar Plot 


The bar plot, or bar chart, is another useful plot which conveys proportions through the use of 
vertical bars. This plot was first seen in Figure and we present another example of this plot 
here. Listing [4.28] summarizes the data from companyData.csv, and presents the total market 
capitalization of each company for each year through a stacked bar plot and a grouped bar plot. The 
results are shown in Figure [4.15] 


Listing 4.28: Two different bar plots 
using CSV, CategoricalArrays, StatsPlots; pyplot () 
= CSV.read("../data/companyData.csv") 


years = levels (df.Year) 
data = reshape(df.MarketCap, 5, 3) 





pl = groupedbar(years, data, bar_position=:stack) 

p2 = groupedbar (years, data, bar_position=: dodge) 

plot(pl, p2, bar_width=0.7, fill=[:blue :red :green], label=["A" "B" "C"], 
ylims=(0,6), xlabel="Year", ylabel="Market Cap (MM)", 
legend=:topleft, size=(800, 400) ) 











In line 5 reshape () is used to reshape the market capitalization data from a single column to a 5 x 3 
array, with the rows representing years and columns companies. In lines 7-11 the groupedbar () 
function from StatsPlots is used to create the bar plots. By setting bar_position=:stack, a 
stackplot is created, while bar_position=: dodge creates a grouped bar plot instead. 
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Figure 4.16: A stack plot showing the change in market capitalization of 
several companies over time. 


Stack Plot 


The stack plot is a commonly used plot which shows how constituent amounts of a metric 
change over time. In Listing we present an example, where we consider the changing total 
market capitalization of the three companies A, B, and C over several years. 


Listing 4.29: |A stack plot 


using CSV, CategoricalArrays, Plots; pyplot() 


df = CSV.read("../data/companyData.csv") 
mktCap = reshape(df.MarketCap, 5, 3) 
years = levels (df.Year) 


areaplot(years, mktCap, 
c=[:blue :red :green], labels-["A" "p" "C"], 
xlims=(minimum(years) ,maximum(years)), ylims=(0,6.5), 
legend=:topleft, xlabel="Years", ylabel="MarketCap") 








In line 4 the data in the Market Cap column is reshaped into a 5x3 array via the reshape () function. 
In line 5 levels () is used to store the unique years of the dataset in the array years in ascending 
order. In lines 7-10 areaplot () is used to create the plot, with the horizontal values given as the 
first argument, and the data to be plotted as the second argument, with rows treated as individual 
years. 





4.7 Working with Files and Remote Servers 


The ability to work with files is an important skill that the modern data scientist is expected 
to have. Often one will be required to perform various operations with files programatically, such 
as create new files, open files, interact with their content, and save information to existing files. 
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Julia provides various methods to interact with and work with files via input/output streams (I/O). 
In this section we provide two simple examples which involve working with files programatically. 
In addition, at the end of this section, we provide a brief pseudo-code example of how one might 
connect to a remote server, query a database, and save the results to a locally stored file. 


Searching a File for Keywords 


For a first example we show how one might programatically search a file for a specific keyword 
or content, and then save that content to a separate file. In Listing [4.30] we create a function which 
searches a text document for a given keyword, and then saves every line of text containing this 
keyword to a new text file, along with the associated line number. 


Listing 4.30: |Filtering an input file 


function lineSearch(inputFilename, outputFilename, keyword) 
infile = open(inputFilename, "r") 
outfile = open (outputFilename, "w") 


for (index, line) in enumerate(split(read(infile, String), "\n")) 
if occursin(keyword, line) 
prine mounie eU cindex n Silane”) 








end 
end 
close (infile) 
close (outfile) 
end 


Eines anciano o lala Mate nine sa a pei) 








17: 71% of Earth’s surface is covered with water, mostly by oceans. The 
19: have many lakes, rivers and other sources of water that contribute to the 





In lines 1-12 the function 1ineSearch() is defined, which searches an input file, inputFilename, 
for a keyword, and saves the lines and line numbers where it appears to an output file, 
outputFilename. Line 2 uses open () with ‘r’ to open the file to be searched in read mode. 
It creates an IOStream object, which can be used as arguments to other functions. We define this as 
the variable infile. Line 3 uses open () with ‘w’ to create and open a file in write mode, with the 
given file name outputFilename. This file is created on disk ready to have information written to 
it. Lines 5-9 contain a for loop, which is used to search through the input file for the given keyword. 
Line 5 reads the file as a String via read (), and split () is used along with An” to convert the 
single string into an array of strings, where the content of each line is stored in a separate index of the 
array. Line 6 uses occursin () to check if the given line contains our given keyword. If it does, 
then we proceed to line 7, where println() is used to write both the index and the line content 
to the out file. Lines 10 and 11 close both our input file and output file. In line 14 1ineSearch is 
used to search the file ‘earth.txt’, for the keyword ‘water’, with the line numbers and text saved 
to the file waterLines.txt’. 
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Searching for Files in a Directory 


In Listing we present our next example, where we create a function which searches a 
directory for all filenames which contain a particular string. It then saves a list of these files to a 
file, fileList. Note that this function does not behave recursively and only searches the directory 
given. 


Listing 4.31: [Searching files in a directory 


function directorySearch (directory, searchString) 
outfile = open("../data/fileList.txt","w") 
fileList = filter(x-»occursin(searchString, x), readdir(directory) ) 





for file in fileList 
println(outfile, file) 
end 
close (outfile) 
end 


directorySearch (pwd(),".j1") 





In lines 1-9 we define the function directorySearch. As arguments, it takes a directory to search 
through, and a searchString. Line 2 uses open with ‘w’ to create our output file fileList.txt, 
which we will write to. In line 3 we create a string array of all filenames in our specified directory 
that contain our searchString. This string array is defined as the variable fileList. The 
readdir() function is used to list all files in the given directory, and filter () is used, along 
with occursin() to check each element contains the searchString. Lines 5-7 loop through each 
element in fileList and print them to theoutput file outfile. Line 8 closes the IOStream 
out file. Line 11 provides an example of the use of our directorySearch function, where we use 
it to obtain a shortlist of all files whose extensions contain ^. 31” within our current working directory, 
ie. pwd (). 





Connecting to a Remote Server 


One may not always work with data stored locally on their machine or network. For example, 
sometimes a dataset is too large to be stored on a workstation, and therefore must be stored remotely 
in a datacentre, or on a remote server. In this scenario one must first connect to the server before 
working with the data. A typical workflow involves connecting to the remote database, submitting 
a query, and then saving the result locally. There are different types of databases, including: Oracle, 
MySQL, PostgreSQL, MongoDB, and many others. There are several Julia packages for connecting 
to remote servers including LibPQ. 31, which is a wrapper for the PostgreSQL libpq C library, 
SQLite. j1, as well as ODBC. j1 and several others. Once a connection is established, one will 
typically submit a so-called SQL query to the server. SQL stands for strucutred query language, 
and is a common syntax used to query remote databases in order to extract a subsect of data from 
the database. 


In this section we do not expand on the details of databases, nor the syntax of SQL queries. 
Instead, in Listing we present a simple pseudocode example of how a user may connect to a 
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remote PostgreSQL database, sumbit a SQL query, and then save the results. 


Listing 4.32: |Pseudocode for a remote database query 


using LibPQ, DataFrames, CSV 


host = "remoteHost" 
dbname cba 

user "username" 
password = "userPwd" 
port LA 





conStr- "host=" xhost * 
port=" xport x 
dbname-" xdbname x 
user=" xuser x 
password-" xpassword 
conn = LibPQ.Connection(conStr) 





df = DataFrame(execute(conn, "SELECT * FROM S1.T1")) 
close (conn) 











CSV.write("example.csv", df); 





In line 1 the LibPQ package is included. It is a wrapper for the libpq postgreSQL library, and contains 
methods to remotely connect to postgreSQL servers and submit queries. In lines 3-7 the details of the 
connection are specified and stored as strings, they include the: host name, database name, username, 
password, and specific port to connect on. Lines 9-13 concatenate these details together into the string 
conStr. In line 14 a connection to the remote server is established via the Connection () function 
from the LibPQ package. The details in the string conStr are used to establish the connection. Note 
that if the password is not given in the connection string, then the server will prompt for a password. 
In line 16 a SQL query is submitted to the server via the execute () function. It takes two arguments, 
the first is the connection to the server, and the second is the SQL query. The query submitted here 
is simple: SELECT x is used to select all columns FROM the T1 table, from the S1 schema, from 
database db1. The results are stored as the DataFrame df. The connection to the server is closed 
in line 17 via close (). In line 19 the data in df is written to the CSV file example.csv, in the 
current working directory. 





Chapter 5 


Statistical Inference Concepts - DRAFT 


This chapter introduces statistical inference concepts, with the goal of establishing a theoretical 
footing of key concepts that follow in later chapters. The approach is that of classical statistics 
as opposed to machine learning, covered in Chapter [9] The action of statistical inference involves 
using mathematical techniques to make conclusions about unknown population parameters based 
on collected data. The field of statistical inference employs a variety of stochastic models to analyze 
and put forward efficient methods for carrying out such analyses. 


In broad generality, analysis and methods of statistical inference can be categorized as either 
frequentist (also known as classical) or Bayesian. The former is based on the assumption that 
population parameters of some underlying distribution, or probability law, exist and are fixed, but 
are yet unknown. The process of statistical inference then deals with making conclusions about 
these parameters based on sampled data. In the latter Bayesian case, it is only assumed that 
there is a prior distribution of the parameters. In this case, the key process deals with analyzing 
a posterior distribution (of the parameters) - an outcome of the inference process. In this book we 
focus almost solely on the classical frequentist approach with the exception of Section [5.7] where we 
explore Bayesian statistics briefly. 


In general, a statistical inference process involves data, a model, and analysis. The data is 
assumed to be comprised of random samples from the model. The goal of the analysis is then 
to make informed statements about population parameters of the model based on the data. Such 
statements typically take one of the following forms: 


Point estimation - Determination of a single value (or vector of values) representing a best estimate 
of the parameter /parameters. In this case, the notion of “best” can be defined in different ways. 


Confidence intervals - Determination of a range of values where the parameter lies. Under the 
model and the statistical process used, it is guaranteed that the parameter lies within this 
range with a pre-specified probability. 


Hypothesis tests - The process of determining if the parameter lies in a given region, in the comple- 
ment of that region, or fails to take on a specific value. Such tests often represent a scientific 


hypothesis in a very natural way. 
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Most of the point estimation, confidence intervals and hypothesis tests that we introduce and 
carry out in this book are elementary. Chapter [6] is devoted to covering elementary confidence 
intervals in detail, and Chapter [7] is devoted to covering elementary hypothesis tests in detail. We 
now begin to explore key ideas and concepts of statistical inference. 


This chapter is structured as follows: In Section [5.1] we present the concept of a random sample 
together with the distribution of statistics, such as the distribution of the sample mean and the 
sample variance. In Section we focus on random samples of normal random variables. In this 
common case, certain statistics have well known distributions that play a central role in statistics. 
In Section we explore the central limit theorem, providing justification for the ubiquity of the 
normal distribution. In Section [5.4] we explore basics of point estimation. In Section [5.5] we explore 
the concept of a confidence interval. In Section we explore concepts of hypothesis testing. 
Finally, in Section [5.7] we explore the basics of Bayesian statistics. 


5.1 A Random Sample 


When carrying out (frequentist) statistical inference, we assume there is some underlying distri- 
bution F(x; 0) from which we are sampling, where 0 is the scalar or vector-valued unknown param- 
eter we wish to know. Importantly, we assume that each observation is statistically independent and 
identically distributed as the rest. That is, from a probabilistic perspective, the observations are 
taken as independent and identically distributed (i.i.d.) random variables. In mathematical statis- 
tics language, this is called a random sample. We denote the random variables of the observations 
by X1,..., Xn and their respective values by 21,..., £n- 


Typically, we compute statistics from the random sample. For example, two common standard 
statistics include the sample mean and sample variance, introduced in Section [4.2] in the context of 
data summary. However, we can model these statistics as random variables, 





n— 1< 
i=1 


== dE 1 a 
K= 2% and S%= 0G — xy. (5.1) 
1= 
Note that for S?, the denominator is n — 1 (as opposed to n as one might expect). This makes S? 
an unbiased estimator. We discuss this property further in Section [5.4] 


In general, the phrase statistic implies a quantity calculated based on the sample. When working 
with data, the sample mean and sample variance are nothing but numbers computed from our 
sample observations. However, in the statistical inference paradigm, we associate random variables 
to these values, since they themselves are functions of the random sample. We look at properties 
of such statistics, and see how they play a role in estimating the unknown underlying distribution 
parameter 0. 


To illustrate the fact that X and S? are random variables, assume we have sampled data from 
an exponential distribution with A = 4.5 ^! (a mean of 4.5 and a variance of 20.25). If we collect 
n — 10 observations, then the sample mean and sample variance are random variables. In Listing 
5.1} we investigate their distribution through Monte Carlo simulation and create Figure The 
point to see is that X and S? are themselves random variables with underlying distributions. 
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— Histogram of Sample Means 
— Histogram of Sample Variances 
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Figure 5.1: Histograms of the sample mean and sample variance of an 
exponential distribution. 


Listing 5.1: [Distributions of the sample mean and sample variance 


using Random, Distributions, Plots; pyplot () 
Random. seed! (0) 


lambda = 1/4.5 
expDist Exponential (1/lambda) 
OPAN 10, 10% 





means = Array{Float64} (undef, N) 
variances Array{Float64} (undef, N) 


Roa 
RO0O00 Door» yoNA 


Lor i in IFN 
data = rand(expDist,n) 
means[i] = mean(data) 
variances[i] = var(data) 
end 


Rh 
NOOB Wb 


println("Actual mean: ",mean(expDist), 
"\nMean of sample means: ",mean (means) ) 
println("Actual variance: ",var(expDist), 
"\nMean of sample variances: ",mean(variances) ) 


SS) (SS) = [es 
= © © oo 


stephist (means, bins=200, c=:blue, normed=true, 
label="Histogram of Sample Means") 

stephist! (variances, bins=600, c=:red, normed=true, 
label="Histogram of Sample Variances", xlims=(0,40), ylims=(0,0.4), 
xlabel = "Statistic value", ylabel = "Density") 








Actual mean: 4.5 

Mean of sample means: 4.500154606762812 
Actual variance: 20.25 

Mean of sample variances: 20.237117004185237 
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In lines 8-9 we initialize the empty arrays means and variances respectively. In lines 11-15 we 
create N random samples, each of length n. For each sample we calculate the sample mean and sample 
variance. In lines 17-20, we calculate the mean() of both arrays means and variances. It can 
be seen that the estimated expected value of our simulated data are good approximations to the 
mean and variance parameters of the underlying exponential distribution. That is, for an exponential 
distribution with rate A the mean is A^! and the variance is \~?. In lines 22-26, we generate histograms 
of the sample means and sample variances, using 200 and 600 bins respectively. 





5.2 Sampling from a Normal Population 


It is often assumed that the distribution F(x ; 0) is a normal distribution, and hence 0 = (p, o°). 
This assumption is called the normality assumption, and is sometimes justified due to the central 
limit theorem, which we cover in Section [5.3] Under the normality assumption, the distribution of 
the random variables X and S? as well as transformations of them are well known. The following 
three distributional relationships play a key role: 


X ~ Normal(u, o? /n), 
[nes fay, a (5.2) 
X-u 
T := == eis. 
am e 


Here ‘~’ denotes ‘distributed as’, and implies that the statistics on the left hand side of the 
symbols are distributed according to the distributions on the right hand side. The notation x2 , 
and £4, denotes a chi-squared and student T-distribution respectively, each with n — 1 degrees of 
freedom. The chi-squared distribution is a gamma distribution (see Section with parameters 
A= 1/2 and a = n/2. The student T-distribution is discussed later in this section. 


6 6 


? 
N 


Importantly, these distributional properties of the statistics from a normal sample theoretically 
support the statistical procedures that are presented in Chapters [6] and [7] 


We now look at an example in Listing where we sample data from a normal distribution 
and compute the statistics, X, T and S?. As seen in Figure [5.2] the distribution of sample means, 
sample variances and T-statistics (T) indeed follow the distributions given by [(5.2) 
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Figure 5.2: Histograms of the simulated sample means, sample variances, and 
T-statistics, against their analytic counterparts. 


Listing 5.2: |Friends of the normal distribution 


using Distributions, Plots; pyplot() 


mu, sigma = 10, 4 
we N TO, OG 


sMeans = Array{Float64} (undef, N) 
sVars Array{Float64} (undef, N) 
TSESLES Array{Float64} (undef, N) 


for i in I:N 

data rand (Normal (mu, sigma),n) 

sampleMean mean (data) 

sampleVars var (data) 

sMeans [i] sampleMean 

iier [T 36 T] sampleVars 

te SEES [al ] (sampleMean - mu)/ (sqrt (sampleVars/n)) 
end 


xRangeMean = 5:0.1:15 
xRangeVar = 0:0.1:60 
xRangeTStat = -5:0.1:5 


pl = stephist(sMeans, bins=50, c=:blue, normed-true, legend=false) 
pl = plot! (xRangeMean, pdf. (Normal (mu, sigma/sqrt(n)), xRangeMean) , 
c=:red, xlims=(5,15), ylims=(0,0.35), xlabel="Sample mean", ylabel="Density") 








p2 = stephist(sVars, bins=50, c=:blue, normed-true, label="Simulated") 

p2 plot!(xRangeVar, (n-1)/sigma*2«pdf.(Chisq(n-1), xRangeVarx (n-1)/sigma^2), 
c=:red, label-"Analytic", xlims-(0,60), ylims=(0,0.06), 
xlabel-"Sample Variance", ylabel="Density") 





p3 = stephist (tStats, bins=100, c=:blue, normed=true, legend=false) 
p3 plot! (xRangeTStat, pdf. (TDist(n-1), xRangeTStat), 
c=:red, xlims-(-5,5), ylims=(0,0.4), xlabel-"t-statistic",ylabel-"Density") 


plot (p1, p2, p3, layout = (1,3), size=(1200, 400)) 
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In line 3 we specify the parameters of the underlying normal distribution from which we sample our 
data. In line 4 we specify the number of samples in each group n, and the total number of Monte 
Carlo repetitions N. In lines 6-8 we initialize three arrays which will be used to store our sample 
means, variances, and T-statistics. In lines 10-17, we conduct our numerical simulation by taking 
n sample observations from the underlying normal distribution, and calculating the sample mean, 
sample variance, and T-statistic. This process is repeated N times, and the values are stored in the 
arrays sMeans, sVars, and tStats respectively. The remainder of the code creates the histograms 
of the sample means, sample variances, and T-statistics alongside the analytic PDF’s given by [(5-2]] 
Observe the PDF of the sample mean in line 24. Observe the PDF of a scaled chi-squared distribution 
through the use of the pdf () and Chisq() functions in line 28. Note that the values on the x-axis 
and the density are both normalized by (n-1)/sigma^2 to reflect the fact we are interested in the 
PDF of a scaled chi-squared distribution. Finally, observe the PDF of the T-statistic (T), which is 
described by a T-distribution, is plotted via the use of the TDist () function in line 33. 





Independence of the Sample Mean and Sample Variance 


We now look at a key property of the sample mean and sample variance. Consider a random 
sample, X1,..., Xn. In general, one would not expect the sample mean, X and the sample variance 
S? to be independent random variables - since both of these statistics rely on the same underly- 
ing values. For example, consider a random sample where n = 2, and let each X; be Bernoulli 
distributed, with parameter p. The joint distribution of X and S? can then be computed as follows. 


If both X;'s are 0, which happens with probability (1 — p)?, then, 


X =0 and S? =0. 


If both X;'s are 1, which happens with probability p?, then, 


X=1 and S?=0. 


If one of the X;'s is 0, and the other is 1, which happens with probability 2p(1 — p), then, 


= og i? 3 
X=- and §%?=1-2(—) =<. 
2 2 2 


Hence, as shown in Figure [5.3] the joint PMF of X and $? is, 








(1 —p)?, for 2 = 0 and s? = 0, 
P(X =7, 8? = 8°) 2p(1 — p), larg = 1/2 and s? = 1/2, (5.3) 
p?, for 7 = 1 and 3 = 1. 


Furthermore, the (marginal) PMF of X is, 


Px(0) = (1 — p) Px(5) =2(1-p),  Py()-p. 
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Figure 5.3: PMF of the sample mean and sample variance. 


And the (marginal) PMF of S? is, 
1 
Pa) =P Pz) 2 Pool) =0. 


We now see that X and S? are not independent because the joint distribution, 


a : . - 1 
P(i, j) = Pẹ(i)Ps2(j) for i,7 € {0, zi 1}, 


constructed by the product of the marginal distributions does not equal the joint distribution 


in (3) 


The example above demonstrates dependence between X and S?. This is in many ways unsur- 
prising. However importantly, in the special case where the samples, X1,..., Xn are from a normal 
distribution, independence between X and S? does hold. In fact, this property characterizes the 
normal distribution - that is, this property only holds for the normal distribution, see |L42]. 


We now explore this concept further in Listing [5.3] In it we compare a standard normal distri- 
bution to what we call a standard uniform distribution - a uniform distribution on [—V3, V3] which 
exhibits zero mean and unit variance. For both distributions, we consider a random sample of size 
n = 3, and from this we obtain the pair (X, $7). We then plot points of these pairs against points 
of pairs where X and S? are each obtained from two separate sample groups. 


From Figure [5.4] it can be seen that for the normal distribution, regardless of whether the pair 
(X, S?) is calculated from the same sample group, or from two different sample groups, the points 
appear to behave similarly. This is because they have the same joint distribution. However, for the 
standard uniform distribution, it can be observed that the points behave in a completely different 
manner. If the sample mean and variance are calculated from the same sample group, then all 
pairs of X and S? fall within a specific bounded region. The envelope of this blue region can be 
clearly observed, and represents the region of all possible combinations of X and S? when calculated 
based on the same sample data. On the other hand, if X and S? are calculated from two separate 
samples, then we observe a scattering of data, shown by the points in red. This difference in 
behavior shows that in this case X and S? are not independent, but rather the outcome of one 
imposes some restriction on the outcome of the other. By comparison, in the case of the standard 
normal distribution, regardless of how the pair (X, S?) are calculated, (from the same sample group 
or from two different groups) the same scattering of points is observed, supporting the fact that X 
and S? are independent. 
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Figure 5.4: Pairs of X and S$? for standard uniform (left) and standard 
normal (right). Blue points are for statistics calculated from the same sample, 
and red for statistics calculated from separate samples. 


Listing 5.3: |Are the sample mean and variance independent? 


using Distributions, Plots, LaTeXStrings; pyplot () 


function statPair(dist,n) 
sample = rand(dist,n) 
[mean (sample) , var (sample) ] 
end 


SEG = Winslow ( -serte (3) , Bere (3) ) 
im. Jj e S5 OS 


dataUni = [statPair(stdUni,n) for _ in 1:N] 

dataUniInd [[mean(rand(stdUni,n)),var(rand(stdUni,n))] for _ in 1:N] 
dataNorm [statPair(Normal(),n) for _ in 1:N] 

dataNormInd [ [mean (rand(Normal(),n)),var(rand(Normal(),n))] for _ in 1:N] 


pi SCatter (first. (datauni), aaee nadaan 
c=:blue, ms=1, msw=0, label="Same group") 

jod scatter! (first. (dataUnilnd), last. (dataUnilnd), 
c=:red, ms=0.8, msw=0, label="Separate group", 
xlabel=L"\overline{X}", ylabel-L"S^2") 











scatter (first. (dataNorm), last.(dataNorm), 
c=:blue, ms=1, msw=0, label="Same group") 

scatter! (first. (dataNormiInd), last. (dataNormInd), 
c=:red, ms=0.8, msw=0, label="Separate group", 
xlabel=L"\overline{X}", ylabel-L"$S^2$") 











plot (pl, p2, ylims=(0,5), size=(800, 400)) 
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In lines 3-6 the function statPair() is defined. It takes a distribution and integer n as input, 
generates a random sample of size n, and then returns the sample mean and sample variance of this 
random sample as an array. In line 8 we define the standard uniform distribution, which has a mean 
of zero and a standard deviation of 1. In line 9 we set the number of observations for each sample n, 
along with the total number of sample groups N. In line 11, the function statPair () is used along 
with a comprehension to calculate N pairs of sample means and variances from N sample groups. Note 
that the observations are all sampled from the standard uniform distribution stdUni, and that the 
output is an array of arrays. In line 12 a similar approach to line 11 is used. However, in this case, 
rather than calculating the sample mean and variance from the same sample group each time, they 
are calculated from two separate sample groups N times. As before, the data is sampled from the 
standard uniform distribution stdUni. Lines 13 and 14 are identical to lines 11-12, however in this 
case observations are sampled from a standard normal distribution Normal (). 





More on the T-Distribution 


Having explored the fact that X and S? are independent in the case of a normal sample, we 
now elaborate on the Student T-distribution and focus on the distribution of the T-statistic, that 
appeared earlier in|(5.2)| This random variable is given by: 

X= 
pert 
S/yn 


Denoting the mean and variance of the normally distributed observations by u and o? respectively, 





we can represent the T-statistic as, 


p- v-e _ Z oN 


VOD AMD E 








n—1 


Here the numerator Z is a standard normal random variable and in the denominator the random 
variable, X2 , = (n — 1)S?/o? is chi-squared distributed with n — 1 degrees of freedom, as claimed 
in Furthermore, the numerator and denominator random variables are independent because 
they are based on the sample mean and sample variance. 


One can show that a ratio of a standard normal random variable and the square root of a scaled 
independent chi-squared random variable (scaled by its degrees of freedom parameter) is distributed 
according to a 'I-distribution with the same number of degrees of freedom as the chi-squared random 
variable. Hence T ~ t(n — 1). This means a “T-distribution with n — 1 degrees of freedom”. The 
T-distribution is a symmetric distribution with a “bell-curved” shape similar to that of the normal 
distribution, with “heavier tails" for non-large n. A t-distribution with k degrees of freedom can be 
shown to have a density function, 


"C ey 


Note the presence of the gamma function, I'(-), which is defined in Section 
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To gain further insight from the representation (5.4), note that E[x?] = (n — 1) and Var(x?) = 
2(n— 1). Thus the variance of y?/(n—1) is 2/(n—1), and hence one may expect that as n — oo, the 
random variable x?/(n — 1) gets more and more concentrated around 1, with the same holding for 
v x?/ (n — 1). Hence for large n one may expect the distribution of T to be similar to the distribution 
of Z, which is indeed the case. This plays a role in the confidence intervals and hypothesis tests in 
the chapters that follow. 





In practice, when carrying out elementary statistical inference using the T-distribution (as pre- 
sented in the following chapters), the most commonly used attribute is the quantile, covered in 
Section It is typically denoted by tx. where the degrees of freedom (DOF), k, define the 
specific T-distribution. Such quantiles are often tabulated in standard statistical tables. 


In Listing [5.4] below we first illustrate the validity of the representation by generating T- 
distributed random variables by using a standard normal and a chi-squared random variable. We 
then plot the PDFs of several T-distributions, illustrating that as the degrees of freedom increase, 
the PDF converges to the standard normal PDF. See Figure [5.5] 


Listing 5.4: |Student’s T-distribution 


using Distributions, Random, Plots; pyplot () 
Random.seed! (0) 


it, IN, aljama 


myT(nObs) = rand(Normal ()) /sqrt (rand (Chisq (nObs-1))/ (nObs-1) ) 
mcQuantile = quantile([myT(n) for _ in 1:N],alpha) 
analyticQuantile = quantile (TDist (n-1),alpha) 


println("Quantile from Monte Carlo: ", mcQuantile) 
println("Analytic qunatile: ", analyticQuantile) 


seric = =520.135 
plot (xGrid, pdf. (Normal (), xGrid), c=:black, label="Normal Distribution") 
scatter! (xGrid, pdf. (TDist(1) ,xGrid), 
c=:blue, msw=0, label="DOF = 1") 
exeenElEGuE Gerici jock. (Dist (3), sere) , 
c=:red, msw=0, label="DOF = 3") 
scatter! (xGrid, pdf. (TDist(100),xGrid), 
c=:green, msw=0, label="DOF = 100", 
xlims=(-4,4), ylims=(0,0.5), xlabel="X", ylabel="Density") 














Quantile from Monte Carlo: -1.8848554309670498 
Analytic qunatile: -1.8856180831641265 





In line 6 we specify the function myT() which generates a t-distributed random variable by using 
a standard normal and a chi-squared random variable, just as in (5.4). In line 7 we use N replica- 
tions of myT () to estimate the alpha quantile. Then in line 8 we compute the quantile analytically 
for a corresponding t-distribution represented by TDist (n-1). The estimated quantile and com- 
puted quantile are then printed in lines 10-11. The remainder of the code plots three t-distributions, 


generating Figure 
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Figure 5.5: As the number of degrees of freedom (DOF) increases, the 
T-distribution approaches that of the normal distribution. 


Two Samples and the F-Distribution 


Many statistical procedures involve the ratio of sample variances, or similar quantities, for two 
or more samples. For example, if X1,..., Xp, is one sample and Yi, 
two sample variances, 


..., Yn, is another sample, and 
both samples are distributed normally with the same parameters, one can look at the ratio of the 


S2 
c REX 
Pstatistic = 927 
Y 


It turns out that such a statistic is distributed according to what is called the F-distribution, with 
density given by, 


q/a/2-1 (E ara 


2 2 


Here the parameters a and b are the numerator degrees of freedom and denominator degrees of 





tatistic We set a = nı — 1 and b = n2 — 1. 
In agreement with [(5.2)| an alternative view is that the random variable F is obtained by the 


explored in Section 


ratio of two independent chi-squared random variables, normalized by their degrees of freedom. The 
F-distribution plays a key role in the popular Analysis of Variance (ANOVA) procedures, further 


We now briefly explore the F-distribution in Listing [5.5] by simulating two sample sets of data 
with nı and na observations respectively from a normal distribution. The ratio of the sample vari- 


ances from the two distributions is then compared to the PDF of an F-distribution with parameters 
nı — l and nz — 1. The listing generates Figure [5.6] 
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Figure 5.6: Histogram of the ratio of two sample variances 
against the PDF of an F-distribution. 


Listing 5.5: [Ratio of variances and the F-distribution 


using Distributions, Plots; pyplot() 


inl, 3302 = 10), Als) 
= 10%6 
mu, Sigma = 10, 4 
normDist = Normal (mu, sigma) 


fValues = Array{Float64} (undef, N) 


for bby IL BIN) 
datal = rand(normDist,n1) 
data2 rand(normDist,n2) 
fValues[i] = var (datal)/var (data2) 
end 


fRange = 0:0.1:5 
stephist (fValues, bins=400, c=:blue, label="Simulated", normed=true) 
plo! (range, PoE. (Digit (umib—1L, me), Eranmge) y 

c=:red, label="Analytic", xlims=(0,5), ylims=(0,0.8), 

xlabel = "F", ylabel = "Density") 








In lines 3-4 we define the total number of observations for our two sample groups, n1 and n2, as well 
as the total number of F-statistics we will generate, N. In lines 10-14 we simulate two separate sample 
groups, datal and data2, by randomly sampling from the same underlying normal distribution. A 
single F-statistic is then calculated from the ratio of the sample variances of these two groups. The 
remainder of the code creates the figure where in line 18 the constructor FDist () is used to create 
an F-distribution with the parameters n1-1 and n2-2. 
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5.3 The Central Limit Theorem 


In the previous section we assumed sampling from a normal population, and this assumption 
gave rise to a variety of properties of statistics associated with the sampling. However, why would 
such an assumption hold? A key lies in one of the most fundamental results of probability and 
statistics, the Central Limit Theorem (CLT). 


While the CLT has several versions and many generalizations, they all have one thing in common: 
summations of a large number of random quantities, each with finite variance, yields a sum that 
is approximately normally distributed. This is the main reason that the normal distribution is 
ubiquitous in nature and present throughout the universe. 


We now develop this more formally. Consider an ii.d. sequence X1, Xo5,... where all X; are 
distributed according to some distribution F(a; ; 0) with mean y and finite variance o?. Consider 


n 
Y, := 2245 
{=l 


It is clear that E[Y,,] = nu and Var(Y,) = no”. Hence we may consider a random variable, 


now the random variable, 














~ Yn — np 
Yn ‘<= — =. 
ý yno 


Observe that Y, is zero mean and unit variance. The CLT states that as n — oo, the distribution 
of Yn converges to a standard normal distribution. That is, for every z € R, 





2: T 1 u2 
lim PY < x) = f e 2 du. 
noo 5 on 
Alternatively, this may be viewed as indicating that for non-small n, 


Yn ~ N (np, no’), 


approx 


where N is a the normal distribution with mean ny and variance no”. 


In addition, by dividing the numerator and denominator of Y, by n, we see an immediate 
consequence of the CLT. That is, for non-small n, the sample mean of n observations denoted by 


X, satisfies, 
= g \2 
Xn Sonate n(m (5) ) ` 


Hence the CLT states that sample means from i.i.d. samples with finite variances are asymptotically 
distributed according to a normal distribution as the sample size grows. This ubiquity of the 
normal distribution justifies the normality assumption employed when using many of the statistical 
procedures that we cover in Chapters [6] and [7] 


To illustrate the CLT, consider three different distributions below, noting that each has a mean 
and variance both equal 1: 


1. A uniform distribution, on [1 — V3, 1+ V3]. 
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Figure 5.7: Histograms of sample means for different underlying distributions. 


2. An exponential distribution with = 1. 


3. A normal distribution with both a mean and variance of 1. 


In Listing we illustrate the central limit theorem, by generating a histogram of N sample 
means for each of the three different distributions mentioned above. Although each of the underlying 
distributions is very different, i.e. uniform, exponential and normal, the sampling distribution of 
the sample means all approach that of the normal distribution centered about 1 with standard 
deviation 1/,/n. Notice that in the case of the exponential distribution, n = 30 isn't “enough” to 
get a “perfect fit” to a normal distribution. 


Listing 5.6: |The central limit theorem 


using Distributions, Plots; pyplot() 


n N= 


casei 
dist2 
lale 3) 


datal 
data2 
data3 


30, 10% 


dato (1. eepete (83) y lspecuate (3) ) 
Exponential (1) 
Normal (1,1) 





[mean (rand (distl,n) for in NH 
[mean (rand (dist2,n) for L in 1N] 
[mean (rand (dist3,n) for in INI 


stephist ([datal data2 data3], bins=100, 
c=[:blue :red :green], xlabel = "x", ylabel 


label=["Average of Uniforms" "Average of 











normed=true, xlims=(0,2), ylims=(0,2.5)) 





"Density", 


Exponentials" "Average of Normals"], 


In lines 5-7 we define three different distribution type objects: a continuous uniform distribution over 
the domain [1 — 4/3, 1+ V3], an exponential distribution with a mean of 1, and a normal distribution 
with mean and standard deviation both 1. In lines 9-11, we generate N sample means, each consisting 
of n observations, for each distribution defined above. In lines 13-16 we plot three separate histograms 
based on the sample mean vectors previously generated. It can be observed that for large N, these 
histograms approach that of a normal distribution, and in addition, the mean of the data approaches 
the mean of the underlying distribution from which the samples were taken. 
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5.4 Point Estimation 


Given a random sample, Xy,..., Xn, a common task of statistical inference is to estimate a 
parameter 0, or a function of it, say h(0). The process of designing an estimator, analyzing its 
performance, and carrying out the estimation is called point estimation. 


Although we can never know the underlying parameter 0, or h(0) exactly, we can arrive at an 
estimate for it via an estimator 0 = f(X1,..., Xn). Here the design of the estimator is embodied 
by f(-), a function that specifies how to construct the estimate from the sample. 


An important question to ask is how close is Ó to the actual unknown quantity 0 or h(0). 
In this section we first describe several ways of quantifying and categorizing this “closeness”, and 


then present two common methods for designing estimators; the method of moments and maximum 
likelihood estimation (MLE). 


The design of (point) estimators is a central part of statistics. However in elementary statistics 
courses for science students, engineers, or social studies researchers, point estimation is often not 
explicitly mentioned. e ET reason for this is that one can estimate the mean and variance via, X 
and S? respectively, see That is, in the case of h(-) being either the mean or the variance 
of the distribution, the E or given by the sample mean or sample variance respectively is a 
natural candidate and performs exceptionally well. However, in other cases, choosing an estimation 
procedure is less straight forward. 


Consider for example the case of a uniform distribution on the range [0,0], and say we are 


interested in estimating 0 based on a random sample, X1,..., Xn. In this case one could construct 
and estimator in many different ways. For example, here are a few alternative estimators: 
6, = fi(Xs,..., Xn) :— = }, 
0, = AN = 
63 = f3(X1,..., Xn) :— oe rein: (5.5) 
ĝa fala... Xn) = 


Each of these makes some sense in their own right; 6, is based on the fact that 0 is an upper 
bound of the observations, 62 and 63 utilize the fact that the sample mean and sample median are 
both expected to fall on 0/2, and finally 64 utilizes the fact that the variance of the distribution is 
given by S? = 67/12. Given that there are various possible estimators, we require a methodology 
for comparing them and perhaps developing others, with the aim of choosing a suitable one. In the 
remainder of this section we describe some methods for analyzing the performance of such estimators 
and others. 


Describing the Performance and Behavior of Estimators 


When analyzing the performance of an estimator Ê, it is important to understand that it is a 
random variable. One common measure of its performance is the Mean Squared Error (MSE), 


























MSE,(6) := E[(0 — 0)?| = Var(Ó) + (E[6] — 0)? :— variance + bias”. (5.6) 
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The second equality arises naturally from adding and subtracting EJ], expanding and collecting 
terms. In this representation, we see that the MSE can be decomposed into the variance of the 
estimator, and it’s bias squared. Low variance is clearly a desirable performance measure. The same 
applies to the bias, which is a measure of the expected difference between the estimator and the 
true parameter value. Note that in machine learning the “Bias variance tradeoff” is often considered 
as a tradeoff between model complexity and model generalizability. The idea is similar to the 
decomposition in (5.6), however it is conceptually different because the setting is different. More 
details are in Chapter [9] 


One question that arises with regards to estimation is: are there cases where estimators are 


unbiased - that is, they have a bias of 0, or alternatively E[0] = 0? The answer is yes. We show this 
now using the sample mean as a simple example. 














Consider X1,..., Xn distributed according to any distribution with a finite mean p. In this case, 
say we are interested in estimating y (note that u = h(0) for some function A). It is easy to see that 
the sample mean X is itself a random variable with mean u, and is hence unbiased. Furthermore, 
the variance of this estimator is 0?/n, where c? is the original variance of X;. Since the estimator 
is unbiased, the MSE equals the variance, i.e. 02/n. In fact, it can be shown that the sample mean 
is the estimator of 0 with minimal mean square error over all other estimators. 


Now consider a case where the population mean u is known, but the population variance c? is 
unknown, and that we wish to estimate it. As a sensible estimator consider, 


n 


o2 := 2 X — u}. (5.7) 






































Ty 
i=1 
Computing the mean of 02 yields: 
| 1 1 
i PEE A m 7 A ce SP: 
ón = 2] Pos u) = uy| = -no = 07. 





2 


Hence 6? is an unbiased estimator for c^. However, say we are now also interested in estimating 


the (population) standard deviation, ø. In this case it is natural to use the estimator, 





Resa 1< 
gieYg e NX, — u}. 
i=1 


Interestingly, while this is a perfectly sensible estimator, it is not unbiased. We illustrate this via 
simulation in Listing [5.7] In it we consider a uniform distribution over [0,1], where the population 
mean, variance and standard deviation are 0.5, 1/12 and 4/1/12 respectively. We then estimate the 
bias of 02 and ĉ via Monte Carlo simulation. The output shows that 6 is not unbiased. However, 
as the numerical results illustrate, it is asymptotically unbiased. That is, the bias tends to O as the 
sample size n grows. 
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Listing 5.7: |A biased estimator 


using Random, Statistics 
Random. seed! (0) 


ixi, truesite = 1/12, sore (1/12) 
function estVar (n) 


sample = rand (n) 
Sumus cmt M OE MEA 


1 
2 
3 
4 
5 
6 
Y 
8 


end 


N LO 
ig 28st SSS SO 
biasVar = mean([estVar(n) for _ in 1:N]) - trueVar 
biasStd = mean([sqrt (estVar (n)) for _ in 1:N]) - trueStd 
prlacia (a = "Limo Y wise loses Y, xeu essemus, QliepdiEs5) , 
Wie Site loiasg Y, Rowe (bins Sta Gies) ) 





n = 5 Var bias: 1.0e-5 Std bias: -0.00642 
n = 10 Var bias: 1.0e-5 Std bias: -0.00303 
n = 15 Var bias: 0.0 Std bias: -0.00199 
n = 20 Var bias: -1.0e-5 Std bias: -0.00148 
n = 25 Var bias: -1.0e-5 Std bias: -0.00117 
n = 30 Var bias: 0.0 Std bias: -0.00098 





In lines 6-8 the function estVar () is defined, which implements [(5.7)] In lines 12-17, we loop over 
sample sizes n = 5,10,15,...,30, and for each we repeat N sampling experiments, for which we 
estimate the biases for o? and 6 respectively. The biases are then estimated and the values stored in 
biasVar and biasStd. 





Having explored an estimator for o? with u known, as well as briefly touching an estimator for 
o for the same case, we now ask the question: What would be a sensible estimator for o? for the 
more realistic case where u is not known? A natural first suggestion would be to replace y in [(5.7)] 
with X to obtain, 


i=l 
With a few lines of computations involving expectations, one can verify that, 


gs? = 22,2 
TL 














Hence it is biased, albeit asymptotically unbiased. This is the reason that the preferred estimator, 


S? is actually, 
n 





9? = S? 


n=1 
as in|(5.1)| This yields an unbiased estimator. 
There are other important qualitative properties of estimators that one may explore. One such 


property is consistency. Roughly, we say that an estimator is consistent if it converges to the 
true value as the number of observations grows to infinity. More can be found in mathematical 
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statistics references such as [DS11]| and [CB01|. The remainder of this section presents two common 
methodologies for estimating parameters; method of moments and maximum likelihood estimation. 
A comparison of these two methodologies is presented. 


Method of Moments 


The method of moments is a methodological way to obtain parameter estimates for a distribution. 
The key idea is based on moment estimators for the k’th moment, E[X7], 














, 1 
Thy = = 3x (5.8) 
i=1 


As a simple example, consider a uniform distribution on [0, 6]. An estimator for the first moment 
(k = 1) is then, hà = X. Now we denote by X a typical random variable from this sample. For 
such a distribution E[X!] = 0/2. We can then equate the moment estimator with the first moment 
expression to arrive at the equation, 














z= mM. 
2 1 


Notice that this equation involves the unknown parameter 0 and the moment estimator obtained 
from the data. Then trivially solving for 0 yields the estimator, 


6 = 271, 
which is exactly 65 from (5.5) 


In cases where there are multiple unknown parameters, say K, we use the first K moment 
estimates to formulate a system of K equations and K unknowns. This system of equations can be 
written as, 














E[X^ ; 061,...,0«] 2; for k=1,...,K. 5.9 
| 


For many textbook examples (such as the uniform distribution case described above), we are able 
to solve this system of equations analytically, yielding a solution, 


0% =9.(Ma,...,Mx) for kl. Ke (5.10) 
Here the functions gx(-) describe the solution of the system of equations. However, it is often not 


possible to obtain explicit expressions for g;,(-). In these cases numerical techniques are typically 
used to solve the corresponding system of equations. 


As an example, consider the triangular distribution with density, 


T-Q0q 


2 a TE lae), 
=g UT 
pat 0% 


This distribution has support (a, 0], and a maximum at c with a € c € b and a < b. Note that the 
Julia triangular distribution function uses this same parameterization: TriangularDist (a,b,c). 
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Now straightforward (yet tedious) computation yields the first three moments, E[X!], E[X?], 
E[X?], as well as the system of equations for the method of moments: 

















1 
Thi = =(a+b+o), 


3 

a l 2,2,2 

the = ¿la +b +c +ab+ac+ bc), (5.11) 
1 

m3 = 15 FU ++ a?b + a?c H ba + 9c 4- a+ c?b + abc). 


Generally, this system of equations is not analytically solvable. Hence, the method of moments 
estimator is given by a numerical solution to In Listing |5.8| given a series of observations, 
we numerically solve this system of equations through the use of the NLsolve package, and arrive 
at estimates for the values of a, b and c. Observe that the equations are symmetric in-terms 
of a, b and c in the sense that permuting these values does not change the equations. Hence, when 
using a numerical solver, there is a possibility that it will return an arbitrary permutation of the 
solutions. We remedy this by sorting the solutions and picking estimators for a, b and c according 
to the sorted order. 


Listing 5.8: |Point estimation via the method of moments using a numerical solver 


using Random, Distributions, NLsolve 
Random. seed! (0) 


Sy A 
= TriangularDist (a,b,c) 
= 2000 
samples = rand (dist,n) 


mile cerme) = 1/mestum (lara. ^1) 
mHats = [m_k(i,samples) for i in 1:3] 


function equations(F, x) 
T = SAS as DH 
= MASAS AA AI s «ES 2 els. > ls dE e| E 
x[2]*x[3] ) = mHats [2] 
im [3] LAO ( zspabjp^s + 24S += Ses + ssqpujezeepe] + xen 2:958 
A EUN a RA ze a KS a a IZ [EA] 
RL Se 2 eS] ) = miers [31 


+ 
+ 


nlOutput = nlsolve (equations, 0. tp We QJ) 
sol = sort (nlOutput.zero) 
anal, bet, Cher = golll; sole, soul 

println("Found estimates for (a,b,c) = ", (aHat, bHat, cHat) , "Mn" ) 
println (nlOutput) 








Found estimates for (a,b,c) = (3.002706152232, 5.003033254712, 3.999191608726) 


Results of Nonlinear Solver Algorithm 
* Algorithm: Trust-region with dogleg and autoscaling 
* Starting Point: [0.1, 0.1, 0.1] 
* Zero: [5.00303, 3.99919, 3.00271] 
* Inf-norm of residuals: 0.000000 
* Iterations: 14 
* Convergence: true 
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x |x - x'| < 0.0e+00: false 
* |f(x)| < 1.0e-08: true 

* Function Calls (f): 15 

* Jacobian Calls (df/dx): 13 





In line 1 the we specify using the NLsolve package. This package contains numerical methods for 
solving non-linear systems of equations. In lines 4-7, we specify the parameters of the triangular 
distribution and the distribution itself dist. We also specify the total number of samples n, and 
generate our sample set of observations samples. In line 9, the function m k () is defined, which 
implements and in line 10, this function is used to estimate the first three moments, given 
our observations samples. In line 12-19, we set up the system of simultaneous equations within 
the function equations (). This specific format is used as it is a requirement of the n1sovle() 
function which is used later. The equations () function takes two arrays as input, F and x. The 
elements of F represent the left hand side of the series of equations (which are later solved for zero), 
and the elements of x represent the corresponding constants of the equations. Note that in setting up 
the equations from [(5.11)] the moment estimators are moved to the right hand side, so that the zeros 
can be found. In line 21, the nlsolve() function from the NLsolve package is used to solve the 
zeros of the function equations (), given starting coefficient estimates of [0.1; 0.1; 0.1]. In 
this example, since the Jacobian was not specified, it is computed by finite differences. In lines 22-23 
we sort the solution and set the estimates of the parameters based on the sorted order. In line 24, the 
zeros of our function are printed as output through the use of . zero, which is used to return just the 
zero field of the nlsolve() output. In line 25, the complete output from the function nlsolve () 
is printed as output. 





Maximum Likelihood Estimation (MLE) 


Maximum likelihood estimation is another commonly used technique for creating point estima- 
tors. In fact, in the study of mathematical statistics, it is probably the most popular method used. 
The key principle is to consider the likelihood of the parameter 0 having a specific value given ob- 
servations 11,...,Tn. That is, what is the most likely parameter value based on the observations. 
This is done via the likelihood function, which is presented below for the i.i.d. case of continuous 
probability distributions, 

n 
NE E EC EPA ; 0) = lf: ; 8). (5.12) 


i=1 


In the second equality, the joint probability density of X4,..., Xn is represented as the product of 
the individual probability densities, since the observations are assumed i.i.d. 


A key observation is that the likelihood, L(-), in[(5.12)lis a function of the parameter 0, influenced 
by the sample, 21,...,x4. Now given the likelihood, the maximum likelihood estimator is a value 0 
that maximizes L(0 ; 21,...,Tn). The rational behind using this as an estimator is that it chooses 
the parameter value 0 that is most plausible, given the observed sample. 


As an example, consider the continuous uniform distribution on [0,0]. In this case, it is useful 
to consider the PDF for an individual observation as, 


fiz: a= Ue € [0,0]) for TER. 
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Here the indicator function 1{-} explicitly constrains the support of the random variable to [0, 6]. 


Now using [(5.12)| it follows that, 


Lig 3 A em a Iit € [0,0]) = ito < min x; }1{max x; <0}. 
i=1 


From this we see that for any sample z1,...,£n with non-negative values, this function (of 0) is 
maximized at 0 = max; x;. Hence as you can see the MLE for this case is exactly 01 from (5.5) 


Many textbooks present constructed examples of MLEs, where the likelihood is a differentiable 
function of 0. In such cases, these MLEs can be solved explicitly, by carrying out the optimization 
of the likelihood function analytically (for example, see and [DS11]). However, this is not 
always possible, and often numerical optimization of the likelihood function is carried out instead. 


As an example, consider the case where we have n random samples from what we know to be a 
gamma distribution, with PDF, 





oa-lg-Amv 


and parameters, A > 0 and o > 0. In such a case where A and a are both unknown, there is not 
an explicit solution to the MLE optimization problem, and hence we resort to numerical methods 
instead. In Listing [5.9] we use MLE to construct a plot of the likelihood function. That is, given 
synthetic data, we calculate the likelihood function for various combinations of œ and A. Note that 
directly after this example, we present an elegant approach for this numerical problem. 


Listing 5.9: The likelihood function for a gamma distributions parameters 


using Random, Distributions, Plots, LaTeXStrings; pyplot () 
Random. seed! (0) 


actualAlpha, actualLambda = 2,3 

gammaDist = Gamma (actualAlpha,l/actualLambda) 
= 10^2 

sample = rand(gammaDist, n) 


alphaGrid = 1:0.02:3 
lambdaGrid = 2:0.02:5 


likelihood = [prod([pdf.(Gamma(a,1/1),v) for v in sample]) 
for 1 in lambdaGrid, a in alphaGrid] 


surface (alphaGrid, lambdaGrid, likelihood, lw=0.1, 
c=cgrad([:blue, :red]), legend=:none, camera = (135,20), 
xlabel=L"\alpha", ylabel=L"\lambda", zlabel="Likelihood") 
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pooul|ey!] 








Figure 5.8: Likelihood function on different combinations of a; and A for a 
gamma distribution. 





In lines 4-5 we specify the parameters a and A, as well as the underlying distribution, gammaDist. 
Note that the gamma distribution in Julia, Gamma (), uses a different parameterization to what is 
outlined in Chapter [| (i.e. Gamma () uses a, and 1/A). In lines 6-7, we generate n sample observations, 
sample. In lines 9-10 we specify the grid of values over which we will calculate the likelihood function, 
based on various combinations of œ and A. In lines 12-13 we first evaluate the likelihood function, 
through the use of the prod () function on an array of all PDF values, evaluated for each 
sample observation, v. Through the use of a two-way comprehension, this process is repeated for all 
possible combinations of a and 1 in alphaGrid and lambdaGrid respectively. This results in a 
2-dimensional array of evaluated likelihood functions for various combinations of œ and A, denoted 
likelihood. Lines 15-17 create the plot. 








The likelihood function plotted in Figure[5.8]embodies the data. An MLE is then the maximizer 
of the likelihood. We now investigate this optimization problem further, and in the process present 
further insight. First observe that any maximizer, 6, of L(0 ; 21,...,Tp) will also maximize its 
logarithm. Practically, both from an analytic and numerical persepective, considering this log- 
likelihood function is often more attractive: 


LOS 34:598) m og EU dp ess) > og (f(a; : 9)). 
i=1 
Hence, given a sample from a gamma distribution as before, the log-likelihood function is, 
£(0 ; 21,..., 24) = nalog(A) — nlog(T(a)) + (a — 1) > log(zi) — AD Bn 
i=1 i=1 


We may then divide by n (without compromising the optimizer) to obtain the following function 
that needs to be maximized: 
K(0 ;%, Ti) = alog(A) — log(I'(a)) + (a — 1)z — Az, 


where, z is the sample mean and, 


1 n 
mpui—— l i 
T, m 3 og(z;) 
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O n-10 
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O n=1000 
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a 


Figure 5.9: Repetitions of MLE for a gamma(2, 3) distribution with 
n = 10,100, 1000. For n = 100 and n = 1000, asymptotic normality is visible. 


Further simplification is possible by removing the stand-alone —z, term, as it does not affect 
the optimal value. Hence our optimization problem is then, 


log(A) + ze) — log(T — AZ. 5.13 
Rax @(log(d) +71) — log(P(a)) — A7 (5.13) 
As is typical in such cases, the function actually depends on the sample only through the two 
sufficient statistics T and Tọ. Now in optimizing|(5.13)| we aren't able to obtain an explicit expression 


for the maximizer. However, taking a as fixed, we may consider the derivative with respect to A, 


and equate this to 0: 

a 

—-ZX=0. 

A 
Hence, for any optimal a*, we have that A* = a*/Z. This allows us to substitute A* for A in|(5.13)| 
to obtain: 


max a(log(a) — log(%) + Te) — log(T(a)) — a. (5.14) 
Now by taking the derivative of with respect to a, and equating this to 0, we obtain, 
log(a) + 1 — log(z) + z; — v(a) — 1 = 0, 
where (2) := 4 log(T (z)) is the well known digamma function. Hence we find that a* must satisfy: 
log(a) — (a) — log(z) + Te = 0. (5.15) 


In addition, since A* = a*/Z%, our optimal MLE solution is given by (a*, A*). In order to find this 


value, |(5.15)| must be solved numerically. 


In Listing[5.10]we do just this. In fact, we repeat the act of numerically solving[(5.15)]many times, 
and in the process illustrate the distribution of the MLE in terms of A and a. Note that there are 
many more properties of the MLE that we do not discuss here, including the asymptotic distribution 
of the MLE, which happens to be a multivariate normal. However, through this example, we provide 
an intuitive illustration of the distribution of the MLE, which is bivariate in this case, and can be 


observed in Figure [5.9] 
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Listing 5.10: |MLE for the gamma distribution 


using SpecialFunctions, Distributions, Roots, Plots, LaTeXStrings; pyplot () 
eq(alpha, xb, xbl) = log(alpha) - digamma (alpha) - log(xb) + xbl 


actualAlpha, actualLambda = 2, 3 
gammaDist = Gamma (actualAlpha, 1/actualLambda) 


function mle(sample) 
alpha = find zero( (a)->eq(a,mean (sample), mean(log. (sample))), 1) 
lambda = alpha/mean (sample) 
return [alpha, lambda] 

end 


N = 10^4 


mles10 = [mle(rand(gammaDist,10)) for _ in 1:N] 
mles100 [mle (rand (gammaDist,100)) for _ in 1:N] 
mles1000 = [mle(rand(gammaDist,1000)) for _ in 1:N] 
scatter (first. (mles10), last. (mles10), 
c=:blue, ms=1, msw=0, label="n = 10") 
scatter! (first. (mles100), last. (mles100), 
c=:red, ms=1, msw=0, label="n = 100") 
scatter! (first. (mles1000), last. (mles1000), 
c=:green, ms=1, msw=0, label="n = 1000", 
xlims=(0,6), ylims=(0,8), xlabel=L"\alpha", ylabel=L"\lambda") 

















In line 1, we specify usage of the SpecialFunctions and Roots packages, as they contain the 
digamma() and find_zero() functions respectively. In line 3, the eq() function implements 
equation [(5.15)] Note it takes three arguments, an alpha value alpha, a sample mean xb, and the 
mean of the log of each observation xb1, which is calculated element wise via log. (). This allows 
us to apply eq() on vectors. In lines 5-6 we specify the actual parameters of the underlying gamma 
distribution, as well as the distribution itself. In lines 8-12 the function mle() is defined, which in 
line 9 takes an array of sample observations, and solves the value of alpha which satisfies the zero of 
eq (). This is done through the use of the find zero() function, and the anonymous function 

(a) ->eq(a,mean (sample) ,mean(log. (sample) )). Note the trailing 1 in line 9, which is used 
as the initial value of the iterative solver. In line 10, the corresponding lambda value is calculated, and 
both alpha and lambda are returned as an array of values. In line 16, 10 random samples are made from 
our gamma distribution, and then the function mle () is used to solve for the corresponding values 
of alpha and lambda and an array of arrays. This experiment is repeated through a comprehension N 
times total, and the resulting array of arrays stored as mles10. Lines 17-18 repeat the same procedure 
as that in line 16, however in these two cases, the experiments are conducted for 100 and 1000 random 
samples respectively. In lines 20-22, a scatterplot of the resulting pairs of à and À are plotted, for 
the cases of the sample size being equal to 10, 100 and 1000. Note the use of the first() and 
last() functions, which are used to return the values of alpha and lambda respectively. Note that 
the bivariate distribution of alpha and lambda can be observed. In addition, for a larger number of 
observations, it can be seen that the data is centered about the true underlying parameters alpha and 
lambda values of 2 and 3. This agrees with the fact that the MLE is asymptotically unbiased. 
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Figure 5.10: Comparing the method of moments and MLE in terms of 
MSE, variance, and bias. 


Comparing the Method of Moments and MLE 


We now carry out an illustrative comparison between a method of moments estimator and 
an MLE estimator on a specific example. Consider a random sample 21,...,2%, from a uniform 
distribution on the interval (a,b). The MLE for the parameter 0 = (a,b), can be shown to be, 


ü = miniz), En}, b = max{x1,...,2n}. (5.16) 


For the method of moments estimator, since X ~ uniform(a, b), it follows that, 








a ar? 
EX] = a var) = LE 























Hence, by solving for a and b, and replacing E[X] and Var(X) with z and s? respectively, we 
obtain, 


à—z— v3s, b—z4v3s. (5.17) 


Observe that here we are actually using the second central moment (variance) as opposed to the 
second moment to construct the estimator. This is a slight variation on the method of moments 
method described above and yields a nicer expression. 


Now we can compare how the estimators and [(5.17)] perform based on MSE, specifically 
the variance and bias. In Listing we use Monte Carlo simulation to compare the estimates of b 
using both the method of moments and MLE, for different cases of n. The code creates Figure [5.10] 
analyzing MSE, bias and variance. As can be seen the MSE of maximum likelihood is lower than 
the MSE of the method of moments and this is due to the variance of maximum likelihood being 
lower. However, maximum likelihood exhibits more significant bias than the method of moments. 
Nevertheless, observe that after squaring, the bias contribution to the MSE is not as significant as 
the variance. The reader should keep in mind that these conclusions about MSE/variance/bias are 
specific to this example. However, there is more supporting theory for the usefullnes of maximum 


likelihood estimation as n — oo. See for example |CBO01 |. 
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Listing 5.11: |MSE, bias and variance of estimators 


using Distributions, Plots; pyplot () 


LOAS 

nMin, nStep, nMax 
nn Int (nMax/nStep) 
sampleSizes nMin:nStep:nMax 
trueB 5 


trueDist - 


10), 10, 200 


Uniform(-2, trueB) 





MLEest (data) 
MMest (data) 


= maximum (data) 
mean (data) + sqrt (3)*std (data) 


res Dict(Symbol,ArrayíFloat64)) ( 
( (sym) -> sym => Array{Float64} (undef,nn)).( 


[:MSeMLE, :MSeMM, :VarMLE, :VarMM, :BiasMLE, :BiasMM])) 














for in 





(E) numerat 
mleEst, Site 
for In IEN 
sample 
mleEst [j] 
mmEst [3] 
end 
meanMLE 
varMLE 


(sampleSizes) 
Array{Float64} (undef, 











mmE NID 





= rand(trueDist,n) 
= MLEest (sample) 
= MMest (sample) 























mean (mmEst) 


Est) 


= mean (mleEst), 
var(mleEst), var (mm 


MSeMLE 
MSeMM 
VarMLE 
VarMM 
BiasMLE 
BiasMM] 





res[: 
res[: 
res[: 
res[: 
res[: 
res[: 


+ (meanMLE 
+ (meanMM — 


= ecw) ^2 
trueB) *2 









































scatter(sampleSizes, [res[:MSeM 
label=["Mean sq.err (MLE)" "Mean 
scatter(sampleSizes, [res[:VarM 
label=["Variance B)" 


E] res[:MSeMM]], 
Sq.err (MM)"]) 
E] res[:VarMM]], 
"Variance (MM)"] 
[res[:BiasMLE] res[:BiasMM]], 
(uri T 


























(MLE 
scatter(sampleSizes, 


label-["Bias (MLE)" "Bias 


( 





c=[: 





ms=10, xlabel="n", 


size=(1200, 


shape=:xcross, 
400) ) 


(pl, p2, p3, 
Layouie= (11, 3), 








In line 4 the minimum, maximum and step size for sample size observations 
used to define the number of sample size groups nn. In lines 7 and 8 the 
is specified. Lines 10 and 11 specify the two estimators in the functions MLI 


c=[:blue 


c=[:blue 
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Array{Float64} (undef, N) 


:red], 


:red], 


blue $redl; 





are specified. These are 
true parameter, t rueB 
Eest() and MMest (). 





Line 13-15 create a dictionary mapping symbols (type Symbol) to arrays (type Array (Float64]). 





The dictionary is initialized with symbol keys :MSeMLI 
arrays. The main simulation loop is in lines 17-33 where we use enumerate to 


p 
Bis 


.,:BiasMM, and with values that are empty 


loop over tuples (i,n), 


with i the index of the iteration and n a value from the range sampleSizes. In each iteration we 
initialize empty arrays for parameter estimates in line 18. We then repeat the experiment N times in 
the loop of lines 19-23. Lines 24-32 record performance measures in the dictionary res. 
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5.5 Confidence Interval as a Concept 


Now that we have dealt with the concept of a point estimator, we consider how confident we are 
about our estimate. The previous section included analysis of such confidence in terms of the mean 
squared error and its variance and bias components. However, given a single sample, X1,..., Xn, 
how does one obtain an indication about the accuracy of the estimate? Here the concept of a 
confidence interval comes as an aid. 


Consider the case where we are trying to estimate the parameter 0. A confidence interval is 
then an interval [L, U] obtained from our sample data, such that, 


PL<0<U)=1-0, (5.18) 


where 1 — a is called the confidence level. Knowing this range [L, U] in addition to 0 is useful, as 
it indicates some level of certainty in regards to the unknown value. Much of elementary classical 
statistics involves explicit formulas for L and U, based on the sample X1,..., X4. Most of Chapter|6] 
is dedicated to this, however in this section we simply introduce the concept through an elementary 
non-standard example. 


Consider a case of a single observation (n = 1) taken from a symmetric triangular distribution, 
with a spread of 2 and an unknown center (mean) y. In this case, we would set, 


L= X + daa; U = X + qi-a/2, 


where qu is the wth quantile of a triangular distribution centered at 0, and having a spread of 2. 
Setting L and U in this manner ensures that holds. Note that this is not the only possible 
construction of a confidence interval, however it makes sense due to the symmetry of the problem. 
For such a triangular distribution, calcuating quantiles using integration or areas of triangles, it 
holds that q,/5 = —1-F ya and q_a/2 = 1 — Va. 


Now, given observations, (a single observation in this case), we can compute L and U. A 
demonstration of this is performed in Listing below. 


Listing 5.12: |A confidence interval for a symmetric triangular distribution 


using Random, Distributions 
Random. seed! (0) 


alpha = 0.05 
L(obs) = obs - (1-sqrt (alpha) ) 
U(obs) = obs + (1-sqrt (alpha) ) 


ml = 5.57 

observation = rand(TriangularDist (mu-1,mu+1,mu)) 
println("Lower bound L: ", L(observation) ) 
println ("Upper bound U: ", U(observation) ) 


1 
2 
3 
4 
5 
6 
Y 
8 
9 
0 
1 


Re 





Lower bound L: 5.1997170907797585 
Upper bound U: 6.7525034952798 
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In lines 5-6, the functions L() and U() implement the formulas above. In this simple example, the 
actual (unknown) parameter value y is set in line 8. Then the sample, a single observation in this 
case, is obtained in line 9. The virtue of the example is in presenting the 95% confidence interval, as 
output by lines 10 and 11. Based on the output (after rounding), we know that with probability 0.95, 
the unknown parameter lies in the range [5.2, 6.75]. 





Let us now further explore the meaning of a confidence interval by considering [[5.18]] The key 
point is that there is a 1 — a chance that the actual parameter 0 lies in the interval |L, U]. This 
means that if the sampling experiment is repeated say N times, then on average, N x (1 — a)% of 
the time the actual parameter 0 is covered by the interval. 


In Listing [5.13] we present an example where we repeat the previous sampling process N = 100 
times. Each time we take a single sample (a single observation in this case) and construct the 
corresponding confidence interval. We observe that about œ x 100 times the confidence interval, 
[L, U], does not include the parameter in question, u. The results are presented in Figure[5.11] 


Listing 5.13: |Repetitions of a confidence interval 


using Random, Distributions, StatsPlots; pyplot () 
Random. seed! (2) 


alpha 0.05 
L (obs) obs - (1-sqrt (alpha) ) 
U (obs) obs + (1-sqrt (alpha) ) 


al = 5,37 
triDist = TriangularDist (mu-1,mu+1,mu) 


= 100 

hitBounds, missBounds = zeros(N, 2), zeros(N,2) 

for Tn 

observation = rand(triDist) 

LL, UU = L(observation), U(observation) 

if LL <= mu && mu <= UU 
hitBounds[i,:] = [LL UU-LL] 





else 
missBounds[i,:] = [LL  UU-] 
end 





end 


groupedbar (hitBounds, bar_position=:stack, 
c=:blue, la=0, fa=[0 1], label="", ylims=(3,8)) 
groupedbar! (missBounds, bar position-:stack, 
c=:red, la=0, fa=[0 1], label="", ylims-(3,8)) 
plot! ([0,N+1], [mu,mu], 
c=:black, xlims=(0,N+1), 
ylims=(3,8), label="Parameter value", ylabel="Value Estimate") 














At the heart of this example we repeat the experiment N = 100 times and create the matrices 
hitBounds and missBounds. These are plotted via the groupedbar () function from package 
StatsPlots. The main loop in lines 13-21 records a confidence interval as a “hit” in line 17 or 
alternatively as a “miss” in line 19. 
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Figure 5.11: 100 confidence intervals. The blue confidence interval bars 
contain the unknown parameter, while the red ones do not. 


5.6 Hypothesis Tests Concepts 


Having explored point estimation and confidence intervals, we now consider ideas associated 
with hypothesis testing. The approach involves partitioning the parameter space O into Oy and 9, 
and then, based on the sample, concluding whether one of two hypotheses, Ho or H4, holds. Here, 


Ho : 0 € Oo, H1:0€604. (5.19) 


The hypothesis Ho is called the null hypothesis and Hı the alternative hypothesis. The former is the 
default hypothesis, and in carrying out hypothesis testing our general aim (or hope) is to reject this 
hypothesis. This is because in typical situations we are wishing to demonstrate that the alternative 
hypothesis holds, as opposed to some well established status quo captured by the null hypothesis. 


























Decision 
Do not reject Ho Reject Ho 
: t T I error 
Hd Correc ype I error 
eae (1-a) (a) 
Reality “true negative” “false positive” 
H, is Type II error Correct 
0 
false (9) (1 — 8) 
“false negative’ “true positive” 








Table 5.1: Type I and Type II errors with their probabilities œ and f respectively. 


Since our decision is based on a random sample, there is always a chance of making a mistakenly 
false conclusion. As summarized in Table [5.1] the two types of errors that can be made are a type T 
error: Rejecting Ho falsely, sometimes called a “false positive", or a type II error: Failing to correctly 
reject Ho, sometimes called a “false negative”. The probability a quantifies the likelihood of making 
a type I error, while the probability of making a type II error is denoted by 8. Note that 1 — B is 
known as the power of the hypothesis test, and this concept of power is covered in more detail in 
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Figure 5.12: The distribution of the test statistic, X*, under Ho. With 
a = 0.05 the rejection region is to the left of the black dashed line. In a 
specific sample, the test statistic is on the red line and we reject Hp. 


Section Note that in carrying out an hypothesis test, o is typically specified, while power is 
not directly controlled, but rather is influenced by the sample size and other factors. 


An important point in terminology is that we don’t use the phrase “accept” for the null hypoth- 
esis, rather we “fail to reject it” (if we stick with Ho) or “reject it” (if we choose H4). This is because 
when we fail to reject Ho, we typically don’t know the actual value of 6, hence we aren't able to 
put a level of certainty on Ho being the case. However if we do reject Ho, then by the design of 
hypothesis tests we can say that our error probability is bounded by a. 


We now present some elementary examples which illustrate the basic concepts involved. Stan- 
dard hypothesis tests are discussed in depth in Chapter 


The Test Statistic, Rejection Region and p-Values 


In general, the key objects in hypothesis testing are the test statistic, the rejection region and 
p-values. Once the scientific question is formulated as an hypothesis by partitioning the parameter 
space according to the next step is to calculate the test statistic. For this we define the test 
statistic, denoted X*, as a function of the data. An example can be the sample mean, the sample 
variance, or other statistics. Importantly, with probabilistic assumptions on the sample data, the 
test statistic is a random variable itself. 


Since the test statistic follows some distribution under Ho, the next step is to consider how 
likely it is to observe the specific value calculated from our sample data. To this end, in setting up 
the hypothesis test we typically choose a significance level a, at 0.05, 0.01, or a similar value. It 
quantifies our level of tolerance for enduring a type I error. For example setting a = 0.01 implies 
we wish to design a test where the probability of type I error is at most 0.01 if Hp holds. Clearly 
a low a is desirable, however there are tradeoffs involved since seeking a very low o will imply a 
high 6 (low power). 
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With the test statistic and a at hand, we are able to determine the rejection region which we 
denote by R. It is a subset of the real line where P(X* € R) € a, under Ho. The idea is then to 
calculate the test statistic X* and reject Ho if X* € R, and otherwise not to reject Ho. Typically R 
is selected at one or both extremes of the support, depending on the distribution of the test statistic 
and the hypothesis [(5.19)| 


To illustrate these concepts, we now present a simple yet non-standard example. Consider that 
we have a series of sample observations distributed as continuous uniform between 0 and some 
unknown upper bound, m. Say that we set, 


Ho: m=1, Hi: m<l. 
With observations X;,..., Xn, one possible test statistic is the sample range: 
X* = qmax(X1,..., Xn) ^ min (X1,..., Xn). 


As is always the case, the test statistic is a random variable. Under Ho we expect the distribution 
of X* to have support [0, 1] with the most likely value being close to 1. This is because low values 
of X* are less plausible under Ho, since we can expect the minimum to be near 0 and the maximum 
to be near 1. The explicit form of the distribution of X* can be analytically obtained however for 
simplicity we use a Monte Carlo simulation to estimate it and and present the density in Figure [5.12] 
for n — 10 observations. 


For this case, it is sensible to reject Ho if X* is small. Hence, denoting quantiles of this 
distribution by qo(u) we set the rejection region as R = [0, qo(o)]. Using Monte Carlo, we also 
compute the rejection region and present it in the figure where the critical value is the upper 
boundary, qo(o)), of the rejection region. Note that computing the rejection region does not require 
any sample data as it is based on model assumptions and not the sample. Still, it is computed via 
Monte Carlo in this specific example. The decision rule for this hypothesis test is simple: Compare 
the observed value of the test statistic, z*, to the critical value qo(o) and reject Ho if z* < qo(o), 
otherwise do not reject. 


An alternative view of hypothesis tests is to consider the p-value. Here we collect the data and 
compute the observed value of the test statistic z*. The p-value is then the maximal œ under which 
the test would be rejected with the observed test statistic. In other words we find p which solves 
z* = qo(p). This is computed via Fo(x*) where Fo(-) is the CDF of X*. 


Using the p-value approach, reporting a low p-value (e.g. p — 0.0024) implies that we are very 
confident in rejecting Ho, while a high p-value (e.g. — 0.24) implies we are not. The p-value approach 
can be used to decide whether Ho should be rejected or not with a specified a. For this, simply 
compare p and a, and reject Ho if p € a. 


Listing [5.14]creates Figure |5.12]and illustrates the operation of the hypothesis test. In this case, 
we illustrate a scenario where the unknown parameter m is muActual = 0.75. With the specific 
seed selected, it turns out that x* = 0.517. This corresponds to a p-value of 0.0141, which is rejected 
for œa = 0.05, however would not be rejected if a = 0.01. Keep in mind that the N repetitions in 
this example are simply to obtain the distribution of X* under Ho and the critical value, 0.6058. 
In the many standard hypothesis tests presented in Chapter |7| the distribution of the test statistic 
is analytically available, so such Monte Carlo based computation is not needed. 
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Notice that with specific sample (depends on the seed), lines 18-19 of the code are not executed. 
However, if you were to change the seed in line 2, this would simulate a scenario with different data 
points, and it is possible to not reject Hp even though Hy holds (muActual < 1). 


Listing 5.14: |The distribution of a test statistic under Ho 


using Distributions, Random, Statistics, Plots; pyplot() 
Random. seed! (2) 


it, lic lpha = 10, 07, (0 595 
mActual = 0.75 
dist0, distl = Uniform(0,1), Uniform(0,mActual) 


Oo -10» CX 4 C5 F2 


ts(sample) = maximum(sample) - minimum(sample) 


empiricalDistUnderHO = [ts(rand(dist0,n)) for _ in 1:N] 
rejectionValue = quantile (empiricalDistUnderH0O, alpha) 


sample = rand(distl,n) 
testStat = ts(sample) 
pValue = sum(empiricalDistUnderH0 .<= testStat) /N 


if testStat > rejectionValue 
print ("Didn’t reject: ", round(testStat,digits-4)) 
print(" > ", round(rejectionValue, digits=4) ) 

else 
print ("Reject: ", round(testStat, digits=4) ) 
print(" <= ", round(rejectionValue, digits=4) ) 

end 

println("\np-value = $(round(pValue, digits=4) )") 





stephist (empiricalDistUnderH0, bins=100, c=:blue, normed=true, label="") 

plot! ([testStat, testStat], [0,4], c=:red, label-"Observed test statistic") 

plot! ([rejectionValue, rejectionValue], [0,4], c=:black, l1s=:dash, 
label="Critical value boundary", legend=:topleft, ylims=(0,4), 
xlabel = "x", ylabel = "Density") 








Reject: 0.517 <= 0.6058 
p-value = 0.0141 





In line 8 we define the function ts () which calculates the test statistic from a sample. We use it in 
lines 10-11 to obtain N (many) samples under Hp and calculate the rejectionValue, qola). The 
actual testing procedure begins in line 13 when we collect our sample, simulating a point in A; (since 
mActual = 0.75). The test statistic is calculated in line 14 and the p-value in line 15. The decision 
rule is then executed in lines 17-23 and the p-value is also presented. The remainder of the code 


creates Figure 
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Figure 5.13: Type I (blue) and Type II (green) errors. The rejection region 
based from 7 — 17.5 to the right is colored with red on the horizontal axis. 


Simple Hypothesis Tests 


When the alternative parameter spaces Oo and O; are only comprised of a single point each, 
the hypothesis test is called a simple hypothesis test. Such a test is often not of great practical 
use, but we introduce it here for pedagogical purposes. Specifically, by analyzing such tests we can 
understand how type I and type II errors interplay. 


As an introductory example, consider a container that contains two identical types of pipes, 
except that one type weighs 15 grams on average and the other 18 grams on average. The standard 
deviation of the weights of both pipe types is 2 grams. Imagine now that we sample a single pipe, 
and wish to determine its type. Denote the weight of this pipe by the random variable X. For this 
example we devise the following statistical hypothesis test: Oy = {15} and O, = {18}. Now, given 
a threshold 7, we reject Ho if X > 7, otherwise we retain Ho. 


In this circumstance, we can explicitly analyze the probabilities of both the type I and type 


II errors, o and 6 respectivily. Listing below generates Figure |5.13| which illustrates this 
graphically for 7 = 17.5. You may try to modify the value of tau in the code to see how the 


probabilities for type I and type II errors vary. 
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Listing 5.15: |A simple hypothesis test 


using Distributions, StatsBase, Plots, LaTeXStrings; pyplot () 


moo, mul, sd, tau = 15, 18, 2, 17.5 

dist0, distl = Normal (mu0,sd), Normal (mul, sd) 
Gea = 530, 1225 

InOepeicl, Inilefedcl = taws O- sao, 530). tew 


¡Dacia ((iPieeloeloalilaseyy (ue gos E errog Y, woh (cheer Orei) 
jowesbeye dba (UPic@loslosiliticy,; (ur os ILI errora Y, cele (else, tei) y 





loe (ojeilal, eE. iett, cuele!) , 
c=:blue, label="Bolt type 15g") 
lotl (QumOesesel. pes (hiest, ase: , 
c=:blue, fa=0.2, fillrange=[0 1], label="") 
lot! (quid, joel. (clisitil, suse) , 
c=:green, label="Bolt type 18g") 
loe! Caleycial, pel. (est, Inlepcial) , 
c=:green, fa=0.2, fillrange=[0 1], label="") 
lot! (tam, 2391, 10,01, 
c=:red, lw=3, label="Rejection region", 
xlims=(5, 25), ylims=(0,0.25) , legend=:topleft, 
xlabel="x", ylabel="Density") 
annotate (Mis, 0.02, texte (b sek (ess 0.02, ese (It) \eulisisva) ) , 
(lS, DoZil, Eos (00) (e) (Ad. wesc (ls 1) 11) 




















Probability of Type I error: 0.10564977366685525 
Probability of Type II error: 0.4012936743170763 








In line 3 we set the parameters of the example. In line 4 we define the distributions under Ay and 
Hj. Line 6 sets girds of values that are used for plotting type I and type II error ranges. In lines 8-9 
we compute o and 6 using ccdf () and cdf () on distO and disti respectively. The remainder 
of the code creates the figure using the pdf () function. Notice the calls to plot! () in lines 13-14 
and 17-18 using the fillrange argument. 





The Receiver Operating Curve 


In the previous example, 7 = 17.5 was arbitrarily chosen. Clearly if 7 was increased the proba- 
bility of making a Type I error, a, would decrease, while the probability of making a type Il error, 
8, would increase. Conversely if we decreased 7 the reverse would occur. We now introduce the 
Receiver Operating Curve (ROC), also sometime called the receiver operating characteristic curve. 
It is a tool that helps to visualize the tradeoff between type I and type II errors. It allows one 
to visualize the error tradeoffs for all possible 7 values simultaneously, for a particular alternative 
hypothesis Hj. 


We look at three different scenarios for 1 : 16,18, and 20. Clearly, the bigger the difference 
between ug and u1, the easier it should be to make a decision without errors. In Listing [5.16] we 
consider each scenario and shift 7, and in the process plot the analytic coordinates of (o(r), 1-8 (r)). 
This is the ROC. It is a parametric plot of the probability of a type I error and power. The results 
are in Figure [5.14] ROCs are also a way of comparing different sets of hypotheses simultaneously. 
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By plotting several different ROCs on the same figure, we can compare the likelihood of making 
errors for various scenarios of different j11’s. 


To better understand how the ROCs are generated, consider also Figure and imagine the 
effect of sliding 7. In this figure, the shaded blue area represents a, while 1 minus the green 
area represents power. If one considers 7 = 25, then both a and power are almost zero, and this 
corresponds (approximately) to (0,0) in Figure [5.14] Now, as the 7 threshold is slowly decreased, 
it can be seen that the power increases at a much faster rate than a, and this behavior is observed 
in the ROC. In addition, as the difference in means between the null and alternative hypotheses 
are greater, the ROC curves are shown to be pushed “further out” from the diagonal dashed line, 
reflecting the fact that such alternative sets of hypotheses are easier to detect. 


1.0 
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Figure 5.14: Three ROCs for various points within Hj. 


using Distributions, StatsBase, Plots, LaTeXStrings; pyplot () 


mul. mula, muto, mule, sd 
teouczio = 580.1225 


dist0 = Normal (mu0, sd) 
distla, distlb, distlc = Normal (mula,sd), Normal (mulb, sd), Normal (mulc, sd) 


falsePositive = ccdf. (dist0,tauGrid) 
truePositiveA, truePositiveB, truePositiveC = 
ecole. (clisic ila, carere, CEEE. (clisicills, Terere CCS n (ell sico, Cawene) 


plot (falsePositive, [truePositiveA truePositiveB truePositiveC], 
c=[:blue :red :green], 
tabsle pisus Nail e 1059 lios Mim = 18% okee Sue = 20%]) 
piloti O, tl, 10,11, eS: black S= dash diues) = deb = dur 
xlims=(0,1), ylims=(0,1), xlabel=L"\alpha", ylabel="Power", 
ratio=:equal, legend=:bottomright) 
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The range tauGrid presents the range of possible values for 7 that are used. The distribution dist0 is 
for Ho and the distributions distla, dist1b and dist1c are for three variants of Hı. The plots are 
then plots of falsePositives vs. truePositiveA, truePositiveB or truePositivec. The 
essence of the plotting code is to use falsePositives for arguments of the horizontal coordinate and 
truePositiveA, truePositiveB or truePositiveC as arguments of the vertical coordinate. 
This creates a parametric plot. Lines 16-18 plot a diagonal dashed line. This line represents the 
extreme case of the distributions of Hy and H; directly overlapping. In this case, the probability of a 
Type I error is the same as the power. 





A Randomized Hypothesis Test 


We now investigate the concept of a randomization test, which is a type of non-parametric test, 
i.e. a statistical test which does not require that we know what type of distribution the data comes 
from. A virtue of non-parametric tests is that they do not impose a specific model. Consider the 
following example, where a farmer wants to test whether a new fertilizer is effective at increasing the 
yield of her tomato plants. As an experiment, she took 20 plants, kept 10 as controls and treated 
the remaining 10 with fertilizer. After two months, she harvested the plants, and recorded the yield 
of each plant (in kg) as shown in Table 





Control | 4.17 5.58 5.18 611 45 461 517 4.53 5.33 5.14 
Fertilizer | 6.31 5.12 5.54 5.5 5.37 5.29 4.92 615 58 5.26 

















Table 5.2: Yield in kg for 10 plants with, and 10 plants without fertilizer (control). 


It can be observed that the group of plants treated with fertilizer have an average yield 0.494 kg 
greater than that of the control group. One could argue that this difference is due to the effects of 
the fertilizer. We now investigate if this is a reasonable assumption. Let us assume for a moment 
that the fertilizer had no effect on plant yield (Hp), and that the result was simply due to random 
chance. In such a scenario, we actually have 20 observations from the same group, and regardless 
of how we arrange our observations we would expect to observe similar results. 


Hence we can investigate the likelihood of this outcome occurring by random chance, by consid- 
ering all possible combinations of 10 samples from our group of 20 observations, and counting how 
many of these combinations result in a difference in sample means greater than or equal to 0.494 kg. 
The proportion of times this occurs is analogous to the likelihood that the difference we observe in 
our sample means was purely due to random chance. It is in a sense the p-value. 


Before proceeding we calculate the number of ways one can sample r = 10 unique items from 


n = 20 total, which is given by, 
(5) = 184, 756. 
10 


Hence the number of possible combinations in our example is computationally manageable. Note 
that in a different situation where n and r would be bigger, e.g. n = 40 and r = 20, the number of 
combinations would be too big for an exhaustive search (about 137 billion). In such a case, a viable 
alternative is to randomly sample combinations for estimating the p-value. 
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In Listing [5.17] we use Julia’s Combinatorics package to enumerate the difference in sample 
means for every possible combination. From the output we observe that only 2.39% of all possible 
combinations result in a sample mean greater than or equal to our treated group, i.e. a difference 
greater than or equal to 0.494 kg. Therefore there is significant statistical evidence that the fertilizer 
increases the yield of the tomato plants, since under Ho, there is only a 2.39% chance of obtaining 
this value or greater by random chance. 


Listing 5.17: |A randomized hypothesis test 


using Combinatorics, Statistics, DataFrames, CSV 
data CO VA ca data) rert Azer. esa) 
controli kdar AC onen GL 


fertilizer = data.FertilizerX 


subGroups = collect (combinations ([control;fertilizer],10) ) 





meanFert = mean(fertilizer) 
pVal = sum([mean(i) >= meanFert for i in subGroups])/length (subGroups) 
println("p-value = ", pVal) 





p-value = 0.023972157873086666 





We use the Combinatorics package for the combinations () function. In line 3-5 we import our 
data, and store the data for the control and fertilized groups in the arrays control and fertilizer. 
In line 7 all observations are concatenated into one array via the use of [ ; ]. Following this, the 
combinations () function is used to generate an iterator object for all combinations of 10 elements 
from our 20 observations. The collect () function then converts this iterator into an array of all 
possible combinations of 10 objects, sampled from 20 total. This array of all combinations is stored 
as subGroups.In line 9, the mean of the fertilizer group is calculated and assigned to the variable 
meanFert. In line 10 the mean of each combination in the array x is calculated and compared against 
meanFert. The proportion of means which are greater than or equal to meanFert is then calculated 
through the use of a comprehension, and the functions sum() and length(). 





5.7 <A Taste of Bayesian Statistics 


In this section we briefly explore the Bayesian approach to statistical inference as an alternative 
to the frequentist view of statistics which was introduced in Sections and and used 
throughout the remainder of the book. In the Bayesian paradigm, the (scalar or vector) parameter, 
0 is not assumed to exist as some fixed unknown quantity, but instead is assumed to follow a 
distribution. That is, the parameter itself is a random variable, and the act of Bayesian inference 
is the process of obtaining more information about the distribution of 0. Such a setup is useful 
in many practical situations since it allows one to incorporate prior beliefs about the parameter, 
before experience from new observations is taken into consideration. It also allows one to carry out 
repeated inference in a very natural manner by allowing inference in future periods to rely on past 
experience or past data. 


'The key objects at play are the prior distribution of the parameter and the posterior distribution 
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of the parameter. The former is postulated beforehand, or exists as a consequence of previous 
inference, while the latter captures the distribution of the parameter after observations are taken 
into account. The relationship between the prior and the posterior is then, 


. . likelihood x prior (0 f(x | 0) x f(0) 
posterior — peer or f0 | x)= enoa (5.20) 








This is nothing but Bayes’ rule applied to densities. Here the prior distribution (density) is 
f(@) and the posterior distribution (density) is f(0 | x). Observe that the denominator, known 
as evidence or marginal likelihood, is constant with respect to the parameter 0. This allows the 
equation to be written as, 


f(8 | zx) x f(x | 0) x £(0), (5.21) 


where the symbol “x” denotes “proportional to". Hence the posterior distribution can be easily 
obtained up to the normalizing constant (the evidence) by multiplying the prior with the likeli- 
hood, f(x |0). 


In general, carrying out Bayesian inference involves the following steps: 


1. Assume some distributional model for the parameters 0. 


2. Use previous inference experience, elicit an expert, or make an educated guess to determine a 
prior distribution for the parameter, f(0). The prior distribution might be parameterized by 
its own parameters, called hyperparameters. 


3. Collect data x, and an expression or a computational mechanism for the likelihood f(x | 6) 
based on the distributional model chosen. 


4. Use the relationship [(5.20)|to obtain the posterior distribution of the parameters, f(0 | x). In 
most cases, the evidence (denominator of|(5.20)) is not easily computable. Hence the posterior 
distribution is only available up to a normalizing constant. In some special cases the form 
of the posterior distribution is the same as the prior distribution. In such cases, conjugacy 
holds, the prior is called a conjugate prior, and the hyperparameters are updated from prior 
to posterior. 


5. The posterior distribution can then be used to make conclusions about the model. For ex- 
ample, if a single specific parameter value is needed to make the model concrete, a Bayes 
estimate based on the posterior distribution, such as for example the posterior mean, may be 
computed: 


ó— foro | x) dé. (5.22) 


Further analyses such as obtaining credible intervals, similar to confidence intervals, may also 
be carried out. See a brief discussion in Section 


6. The model with Ó can then be used for making conclusions. Alternatively, a whole class of 
models based on the posterior distribution f(@ | x) can be used. This often goes hand in hand 
with simulation as one is able to generate Monte Carlo samples from the posterior distribution. 


5.7. A TASTE OF BAYESIAN STATISTICS 217 


Bayesian inference has gained significant popularity over the past few decades and has evolved 
together with the whole field of computational statistics. Unless conjugacy holds, there is typically 
not an explicit expression for the evidence (the integral in and hence a computational 
challenge is to make use of the posterior available only up to a normalizing constant. We now 
elaborate on the details through variants of a very simple example in order to understand the main 
concepts. For a general treatment of Bayesian inference we recommend [R07]. 


A Simple Poisson Example 


Consider an example where an insurance company models the number of weekly fires in a city 
using a Poisson distribution with parameter A. Here A is also the expected number of fires per week. 
Assume that the following data is collected over a period of 16 weeks, 


x = (Ms... £16) = (2, 1,0,0, 1,0, 2, 2, 5,2, 4,0,3, 2, 5,0). 


Each data point indicates the number of fires per week. In this case the MLE is À = 1.8125 simply 
obtained by the sample mean. Hence in a frequentist approach, after 16 weeks the distribution of 
the number of fires per week is modeled by a Poisson distribution with A = 1.8125. One can then 
obtain estimates for say, the probability of having more than 5 fires in a given week as follows: 


P(fires per week > 5) = 1 — p» QE = 0.0107. (5.23) 


5 k 
k! 
k=0 


However, the drawback of such an approach in estimating A is that it didn’t make use of previous 
information. By comparison, in a Bayesian approach the estimate would allow one to incorporate 
information from previous years, or alternatively from adjacent geographical areas. Say that for 
example, further knowledge comes to light that the number of fires per week ranges between 0 and 
10 and that the typical number is 2 fires per week. In this case one can assign a prior distribution 
to A that captures this belief. Here is where some critics claim that such use of Bayesian statistics 
turns into somewhat of a “voodoo science” since we have an infinite number of options to choose for 
the prior. Still, it is often useful. 


Anyways, you also have an infinite number of choices for the model no matter what approach 
you use! Truth is you are always using some prior info. 


One could easily argue that the MLE assumes that 100000 fires per week is just as likely as 1, 
which is not only imposing prior assumptions, but prior assumptions which are wrong and possibly 
allow for higher values than are plausible for small datasets. 


Anyways, priors are just a fancy form of regularization really. Bayesian approaches tend to be 
most beneficial in small data regimes. 


In our example, assume that we decide to use a triangular distribution as shown in blue in 
Figure Such a triangular distribution captures prior beliefs about the parameter A well, 
because it has a defined range and a defined mode. 


With the prior assigned and the data collected, we can use the machinery of Bayesian inference 
of|(5.20)| In this specific case the prior distribution of the parameter A is the triangular distribution 
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— Prior distribution 
— Posterior distribution 


Density 








Figure 5.15: The prior distribution in blue and the posterior in red for 
Bayesian estimation of a Poisson distribution. 


with the PDF, 


JEA, A € [0,3], 
A EAN A € (3,10). 


With the 16 observations, 1,,..., 216, the likelihood is, 


ale) = [e 
J(A| zx) e "^—-. 
! 
kel Tk: 
Hence the posterior is proportional to f(A | x) f(A). However, normalization of this function in of 
A, requires dividing it by the evidence, given by, 
10 


] f(x | AFA) dA. 
Typically this integral isn’t easy to evaluate analytically, hence numerical methods are often used. 
For illustration purposes, we carry out this numerical integration as part of Listing [5.18] where we 
also plot the resulting posterior distribution (red curve in Figure [5.15]. To appreciate potential 
problems with such a numerical solution, imagine cases where the parameter 0 is not just the scalar 
À but rather consists of multiple dimensions. The integral of the evidence cannot be efficiently 


computed in such cases. 


In Listing [5.18] once the prior distribution is obtained, we compute its mean to obtain a Bayes 
estimate for A. The value obtained differs from the MLE obtained above and hence probability 
estimates using the model, such as [(5.23)] would also vary. Importantly, by employing the Bayesian 
perspective, we were able to incorporate prior knowledge into the inference procedure. 
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Listing 5.18: |Bayesian inference with a triangular prior 


using Distributions, Plots, LaTeXStrings; pyplot () 


prior(lam) = pdf(TriangularDist(0, 10, 3), lam) 
caia = [21,0 0,1,0,24,27 9724053227901 


like(lam) = *([pdf (Poisson(lam),x) for x in data]...) 
posteriorUpToK(lam) = like(lam) «prior (lam) 


delta = 10^-4. 

lamRange = 0:delta:10 

K = sum([posteriorUpToK (lam) «delta for lam in lamRange] ) 
posterior(lam) = posteriorUpToK (lam) /K 





bayesEstimate = sum([lam*posterior(lam)*delta for lam in lamRange] ) 





println("Bayes estimate: ",bayesEstimate) 


plot (lamRange, prior. (lamRange), 
c=:blue, label="Prior distribution") 
plot! (lamRange, posterior. (lamRange), 
c=:red, label="Posterior distribution", 
xlims=(0, 10), ylims=(0, 1.2), 
xlabel=L"\lambda", ylabel="Density") 











Bayes estimate: 1.9371887551439297 





In line 3 we define the prior. In line 4 we set the data values. In line 6 the likelihood function is 
defined. Notice that the * operator is used as a function, and that the splat operator, ... is applied 
inside the brackets. Equation [(5.21pjis implemented in Line 7, while lines 9-11 are used to numerically 
compute the evidence. The actual posterior is defined in line 12. In line 14 a Bayes estimate from 
the prior is calculated, according to [(5.22)] and printed in line 15. The remainder of the code creates 
Figure [5.15 





Conjugate Priors 


Following on from the previous example, a natural question arises: why use the specific form 
of the prior distribution that we used? After all, the results would vary if we were to choose a 
different prior. While in generality Bayesian statistics doesn't supply a complete answer, there are 
cases where certain families of prior distributions work very well with certain (other) families of 
statistical models. 


For example, in our case of a Poisson probability distribution model, it turns out that assuming 
a gamma prior distribution works nicely. This is because the resulting posterior distribution is also 
guaranteed to be gamma. In such a case, the gamma distribution is said to be a conjugate prior to the 
Poisson distribution. The parameters of the prior /posterior distribution are called hyperparameters, 
and by exhibiting a conjugate prior distribution relationship, the hyperparameters typically have a 
simple update law from prior to posterior. This relieves a huge computational burden. 


To see this in the case of gamma-Poisson, assume the hyperparameters of the prior to have a 
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(shape parameter) and 8 (rate parameter). Now using the Poisson likelihood and the gamma PDF 
we obtain: 


posterior œ (II. ex) io REA 
o e A AZ k= 9k 01 ¿84 
= AÈ k= skle Aln) 
x gamma density with shape parameter a + >> zx; and scale parameter f 4- n. 


(5.24) 


This shows us the gamma-Poisson conjugacy and implies a slick update rule for the hyperpa- 
rameters: The hyperparameter a is updated to a + ` x; and the hyperparameter f is updated to 
part 


In Listing we use a gamma prior with prior parameters of a = 8 and 6 = 2. For illustration, 
we compute the posterior both using the brute force method of the previous listing and using 
the simple hyperparameter update rule due to conjugacy. The posterior and prior are plotted in 


Figure 


Listing 5.19: |Bayesian inference with a gamma prior 


using Distributions, Plots; pyplot () 


alpha, beta = 8, 2 
prior(lam) = pdf(Gamma(alpha, 1/beta), lam) 
cara = (2, 1,0,0,1,0, 2,2, 5, 274,07 327301 


like(lam) = «([pdf(Poisson(lam),x) for x in data]...) 
posteriorUpToK(lam) = like (lam) «prior (lam) 


delta = 10^-4. 

lamRange = 0:delta:10 

K = sum([posteriorUpToK (lam) «delta for lam in lamRange] ) 
posterior(lam) = posteriorUpToK (lam) /K 





bayesEstimate = sum([lamxposterior(lam)*delta for lam in lamRange] ) 


newAlpha, newBeta = alpha + sum(data), beta + length (data) 
closedFormBayesEstimate = mean(Gamma(newAlpha, 1/newBeta) ) 








println("Computational Bayes Estimate: ", bayesEstimate) 
println("Closed form Bayes Estimate: ", closedFormBayesEstimate) 











plot (lamRange, prior. (lamRange), 
c=:blue, label="Prior distribution") 
plot! (lamRange, posterior. (lamRange), 
c=:red, label="Posterior distribution", 
xlims=(0, 10), ylims=(0, 1.2), 
xlabel=L"\lambda", ylabel="Density") 











Computational Bayes Estimate: 2.055555555555556 
Closed form Bayes Estimate: 2.0555555555555554 
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— Prior distribution 
— Posterior distribution 


Density 








Figure 5.16: The prior and posterior for Bayesian estimation of a Poisson 
distribution using gamma, conjugacy. 





In lines 3 the prior hyperparameters are defined and in line 4 the prior distribution is defined. In 
lines 7-13 the posterior is calculated in the brute force same manner as listing [5.18] Similarly in 
line 15 we compute the Bayes estimate in the same manner. Line 17 is where the simplicity of 
conjugacy comes about, the hyperparameters are updated according to the conjugacy rule. Then in 
line 18 closedFormBayesEstimate is computed just using the formula for the mean of a gamma 
distribution (using mean() from Distributions). The bayes estimates are printed in lines 20-21 
and the remaining code lines create Figure [5.16] 








Markov Chain Monte Carlo 


In many applicative cases of Bayesian statistics, convenient situations of conjugate priors are 
not available, yet computation of posterior distributions and Bayes estimates are needed. In cases 
where the dimension of the parameter space is high, carrying out straightforward integration as done 
in Listing is not possible. However, there are other ways of carrying out Bayesian inference. 
One such popular way is by using algorithms that fall under the category known as Markov Chain 
Monte Carlo, MCMC, also known as Monte Carlo Markov Chain (with a different word order). 


The Metropolis—Hastings algorithm is one such popular MCMC algorithm. It produces a series 
of samples 0(1),0(2),0(3),..., where it is guaranteed that for large t, 0(t) is distributed according 
to the posterior distribution. Technically, the random sequence {0(t)}?2, is a Markov chain (see 
Chapter|9]for more details about Markov chains) and it is guaranteed that the stationary distribution 
of this Markov chain is the specified posterior distribution. That is, the posterior distribution is an 
input parameter to the algorithm. 


The major benefit of Metropolis-Hastings and similar MCMC algorithms is that they only uses 
ratios of the posterior on different parameter values. For example, for parameter values 6; and 62, 
the algorithm only uses the posterior distribution via the ratio, 


f(&1 | x) 


E61, 62) = FG | ay 
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This means that the normalizing constant (evidence) is not needed as it is implicitly cancelled out. 


Thus using the posterior in the proportional form |(5.21)] suffices. 


Further to the posterior distribution, an additional input parameter to Metropolis—Hastings is 
the so-called proposal density, denoted by q(- | -). This is a family of probability distributions where 
given a certain value of 01 taken as a parameter, the new value, say 05, is distributed with PDF, 


q(0 | 01). 


The idea of Metropolis—Hastings is to walk around the parameter space by randomly generating 
new values using q(: | -). Then some new values are ‘accepted’ while others are not, all with a 
manner which ensures the desired limiting behavior. The algorithm specification is to accept with 
probability, 

q(9(t) | a 
(6 [6(5) J^ 


where 0* is the new proposed value, generated via q( - | 0(t)), and 0(t) is the current value. With 
each such iteration, the new value is accepted with probability H and otherwise rejected. With 
certain technical requirements on the posterior and proposal densities, the theory of Markov chains 
then guarantees that the stationary distribution of the sequence (0(t)) is the posterior distribution. 


H = min {1, L(0*, 0(t)) 


Different variants of the Metropolis—Hastings algorithm employ different types of proposal den- 
sities. There are also generalizations and extensions that we don’t discuss here, such as Gibbs 
Sampling and Hamiltonian Monte Carlo for example. 


To help illustrate some of these concepts, we now implement a simple version of Metropolis— 
Hastings where we use the folded normal distribution as a proposal density. This distribution is 
achieved by taking a normal random variable X with mean y and variance 0? and considering 
Y = |X|. In this case, the PDF of Y is, 


1 (y—n)? (y+)? 


fy) - - =(e mt orem. (5.25) 








Our choice of this specific density is purely for simplicity of implementation, and in addition it suits 
the case that we demonstrate, where the support of the parameter in question is non-negative. 


In Listing [5.20] we implement Metropolis—Hastings for the same data and prior as the previous 
example, Listing In such an example one would not use to use MCMC since conjugacy is 
much more efficient, however we do so here for purpuses of comparison. Our results show that we 
obtain the same numerical results as we did using gamma conjugacy. The histogram of the samples 


is plotted in Figure 
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Listing 5.20: |Bayesian inference using MCMC 


using Distributions, Plots; pyplot () 


alpha, beta = 8, 2 
prior(lam) = pdf(Gamma(alpha, 1/beta), lam) 
cara = [Ai O, (01.671912 2 2E D a ad aa Ol 


like(lam) = «([pdf(Poisson(lam),x) for x in data]... 
posteriorUpToK(lam) = like(lam)*prior (lam) 


sig = 0.5 
foldedNormalPDF (x,mu) = (1/sqrt (2xpix*sig”2))+* (exp (- (x-mu)^2/2sig^2) 

+ ello (Es) 2/28 02) ) 
foldedNormalRV (mu) = abs(rand(Normal (mu, sig))) 


function sampler (piProb, qProp, rvProp) 
lam = 1 
warmN, N = 10^5, 10^6 
samples = zeros (N-warmN) 


för © nel 
while true 
lamTry = rvProp (lam) 
L = piProb(lamTry) /piProb (lam) 
H min(1,L*qProp (lam, lamTry) /qProp (lamTry, lam) ) 
sue eme) < 1 
lam = lamTry 
if t > warmN 
samples[t-warmN] = lam 
end 
break 
end 
end 
end 
return samples 
end 


mcmcSamples = sampler (posteriorUpToK, foldedNormalPDF, foldedNormalRV) 
println("MCMC Bayes Estimate: ",mean(mcmcSamples) ) 





stephist (mcmcSamples, bins=100, 
c=:black, normed=true, label="Histogram of MCMC samples") 


lamRange = 0:0.01:10 
plot! (lamRange, prior. (lamRange), 
c=:blue, label="Prior distribution") 





closedFormPosterior (lam) =pdf (Gamma (alpha + sum(data),1/ (beta+length (data) )), lam) 
plot! (lamRange, closedFormPosterior. (lamRange), 
c=:red, label="Posterior distribution", 
xlims=(0, 10), ylims=(0, 1.2), 
xlabel=L"\lambda", ylabel="Density") 











MCMC Bayes Estimate: 2.065756632471559 
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Figure 5.17: The prior and the posterior for Monte Carlo Markov Chain 
samples generated using Metropolis—Hastings 





Lines 3-8 are similar to the previous listings [5.18] and In ines 10-13 the proposal density 
foldedNormalPDF () is define d in — with |( EE io with a function for generating 


a proposal random variable, foldedNormalRV(). Lines s 35 define the function sampler(). It 
operates on a desired (Comenta ird density, 2 and runs the Metropolis-Hastings algorithm 
for sampling from that density. The argument, qProp, is the proposal density and the argument 
rvProp is for generating from the proposal. All three arguments are assumed to be functions which 
sampler () invokes. Our implementation uses a warm up sequence with a length specified by warmN 
in line 17. The idea here is to let the algorithm run for a while to remove any bias introduced by 
initial values. Lines 20-33 constitute the main loop over N samples generated by the algorithm. In 
our implementation, we setup an internal loop (lines 21-32) that iterates until a proposal is accepted 
(and breaks in line 30). Line 45 prints the Bayes estimate. As can be seen, it agrees with the estimate 


of Listing |5.19 





Chapter 6 


Confidence Intervals - DRAFT 


In this chapter we cover a variety of confidence intervals used in standard statistical procedures. 
As introduced in Section [5.5] a confidence interval with a confidence level 1 — a is an interval |L, U] 
resulting from the observations. When considering confidence intervals in the setting of symmetric 
sampling distributions (as is the case for most of this chapter), a typical formula for [L, U] is of the 
form, . 
0+ Ka Serr. (6.1) 
Here 6 is typically the point estimate for the parameter in question, Serr is some measure or estimate 
of the variability (e.g. standard error), and Ka is a constant which depends on the model at hand 
and on a. Typically by decreasing a — 0, we have that Ka increases, implying a wider confidence 
interval. For the examples in this chapter, common values for Ką are in the range of [1.5, 3.5] for 
values of o in the range of [0.01,0.1]. Most of the confidence intervals presented in this chapter 
follow the form of with the specific form of Ka often depending on conditions such as: 





Sample size: Is the sample size small or large. 
Variance: Is the variance 02? known or unknown. 


Distribution: Is the population assumed normally distributed or not. 


In exploring confidence intervals with Julia, we compute the confidence intervals using standard 
statistical formulas and then illustrate how they can be obtained using the HypothesisTests 
package. This package includes various functions that generate objects resulting from specific sta- 
tistical procedures. We can either look at the output of these objects, or query them using other 
functions, specifically the confint () function. This package is also used extensively in Chapter [7] 





The individual sections of this chapter focus on specific confidence intervals and general concepts. 
In Section [6.1] we cover confidence intervals for the mean of a single population. In Section [6.2] we 
present comparisons of means of two populations. In Section [6.3] we cover confidence intervals for 
proportions. In Section [6.4] we gain a better understanding of model assumptions via the example 
of a confidence interval for the variance. In Section [6.5] we present the bootstrap method, a general 
methodology for creating confidence intervals. In Section we present prediction intervals, a 
concept dealing with prediction of future observations based on previous ones. We close with 
Section which deals with credible intervals from Bayesian statistics. 
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6.1 Single Sample Confidence Intervals for the Mean 


Let us first consider the case where we wish to estimate the population mean y using a random 


sample, X1,..., Xn. As covered previously, a point estimate for the mean is the sample mean X. 
A typical formula for the confidence interval of the mean is then, 

= S 

X+ Ka (6.2) 





vn 
Here the bounds around the point estimator X are defined by the addition and subtraction of a 


multiple, Ka, of the standard error, S/yn, first introduced in Section [4.2] The multiple Ka takes 


on different forms depending on the specific case at hand. 


Population Variance Known 


If we assume that the population variance, 0?, is known and the data is normally distributed, 
then the sample mean X is normally distributed with mean y and variance c?/n. This yields, 


P(n- as Sut gag) = 1-0 (6.3) 


where Zu is the 1 — $ quantile of the standard normal distribution. In Julia this is computed 
via quantile (Normal (),1-alpha/2). If we denote the actual sample mean estimate obtained 
from data by z, then by rearranging the inequalities inside the probability statement above, we 


obtain the following confidence interval formula, 





(6.4) 


TE E 21-2 — 

at 
In practice o^ is rarely known, hence it is tempting to replace ø by s (sample standard deviation), 
in the formula above. Such a replacement is generally fine for large samples. However, in the case 
of small samples, one should confidence intervals assuming population variance unknown, covered 
at the end of this section. 


2 


The validity of the normality assumption should also be considered. In cases where the data 
is not normally distributed, the probability statement [(6.3)] only approximately holds. However as 
n, —> oo, it quickly becomes precise due to the central limit theorem. Hence the confidence interval 
[(6.4)] may be used for non-small samples. 


In Julia, computation of confidence intervals is done using functions from the HypothesisTests 
package (even when we don’t carry out an hypothesis test). The code in Listing [6.1] illustrates com- 
putation of the confidence interval using both the package and by evaluating the formula 
directly. It can be observed that both the direct computation and the use of the confint () 
function yield the same result. 
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Listing 6.1: |CI for single sample population, variance assumed known 


using CSV, Distributions, HypothesisTests 


data = CSV.read("../data/machinel.csv", header=false) [:,1] 
xBar, n = mean (data), length (data) 

Sag 1522 

alpha = 0.1 

Z = quantile (Normal(),1-alpha/2) 


println("Calculating formula: ", (xBar - z*sig/sqrt(n), xBar + zxsig/sqrt (n))) 
Tod im (Using contint tunction -"veeconmint(onesampleZrest(xBartsd gnam) 





Calculating formula: (52.51484557853184, 53.397566664027984) 
Using confint() function: (52.51484557853184, 53.397566664027984) 





Line 3 loads the data. Note the use of the header-false argument, and also the trailing [:,1] 
which is used to select all rows of the data. In line 4 we calculate the sample mean, and the number 
of observations. In line 5, we stipulate the standard deviation as 1.2, as this scenario is one in which 
the population standard deviation, or population variance, is assumed known. In line 7 we calculate 
the value of z for 1 — a/2. This quantity does not depend on the sample but is a fixed number. As is 
well known from statistical tables it equals approximately 1.65 when a = 10%. In line 9 the formula 
for the confidence interval|(6.4)|is evaluated directly. In line 10 the function OneSampleZTest () is 
first used to conduct a one sample z-test given the parameters xBar, sig, and n. The confint () 
function is then applied to this output, for the specified value of alpha. It can be observed that the 
two methods are in agreement. Note that hypothesis tests are covered further in Chapter [7] 





Population Variance Unknown 


A celebrated procedure in elementary statistics is the confidence interval based on the T- 
distribution. Here we relax the assumptions of the previous confidence interval by allowing c? 
to be an unknown quantity. In this case, if we replace o by the sample standard deviation s, 
then the probability statement no longer holds. However, by using the T-distribution (see 
Section [5.2) we are able to correct the confidence interval to, 


S 


Ef e n-1 Um 


Here, 14.0 n-1 is the 1 — $ quantile of a T-distribution with n — 1 degrees of freedom. This can be 
calculated in Julia via quantile (TDist (n-1),1-alpha/2). 





Ga 


(6.5) 


For small samples, the replacement of gs from by ti-a n-1 in (6.5) significantly affects 
the width of the confidence interval, as for the same value of a, the T case is wider. However, as 
n — oo, we have, ti-29-1 > 4-2, 88 illustrated in Figure Hence for non-small samples, 
the confidence interval [(6.5)|is very close to the confidence interva [(6.4)] with s replacing a. Note 
that the T-confidence interval hinges on the normality assumption of the data. In fact for small 
samples, cases that deviate from normality imply imprecision of the confidence intervals. However 
for larger samples, these confidence intervals serve as a good approximation. Still in these larger 
sample cases, one might as well use 212 instead of t1-2,.n-1- 
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The code in Listing calculates the confidence interval [(6.5)| where it is assumed that the 


population variance is unknown. 


Listing 6.2: ¡CI for single sample population with variance assumed unknown 


using CSV, Distributions, HypothesisTests 


data = CSV.read("../data/machinel.csv", header=false) [:,1] 
xBar, n = mean (data), length (data) 

s = std(data) 

alpha = 0.1 

t = quantile (mise (n-1),1-alpha/2) 


¡aca (iCeuleuileie ioe; ojala UY, (Gema = Es sera (00) y ¿dius r ese siepae (a) )) ) 
println("Using confint() function: ", confint (OneSampleTTest (xBar,s,n),alpha) ) 





Calculating formula: (52.49989385779555, 53.412518384764276) 
Using confint () function: (52.49989385779555, 53.412518384764276) 





This example is very similar to Listing however there are several differences. In line 5, since 
the population variance is assumed unknown, the population standard deviation sig of Listing [6.1 
is replaced with the sample standard deviation s. In line 7 the quantile t is calculated on a T- 
distribution, TDist (n-1), with n — 1 degrees of freedom. Previously, the quantile z was calculated 
on a standard normal distribution Normal (). Lines 9 and 10 are very similar to those in the previous 
listing, but z and sig are replaced with t and s respectively. It can be seen that the outputs of 
lines 9 and 10 are in agreement, and that the confidence interval is wider than that calculated in the 
previous Listing |6.1 





6.2 Two Sample Confidence Intervals for the Difference in Means 


We now consider cases in which there are two populations involved. As an example, consider 
two separate machines, 1 and 2, which are designed to make pipes of the same diameter. In this 
case, due to small differences and tolerances in the manufacturing process, the distribution of pipe 
diameters from each machine will differ. In such cases where two populations are involved, it is 
often of interest to estimate the difference between the population means, 41 — u2. 


In order to do this we first collect two random samples, %11,...,%n,1 and 212,...,Uno2. For 
each sample ¿ = 1,2 we have the sample mean z;, and sample standard deviation s;. In addition, 
the difference in sample means, zı — T2 serves as a point estimate for the difference in population 
means, H1 — H2. 


A confidence interval for ijj — u2 around the point estimate 11 — Tə is then constructed via the 
same process seen previously, 





Tı — 2c KasSerr. (6.6) 


We now elaborate on the values of Ka and serr based on model assumptions. 
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In the (unrealistic) case that the population variances are known, we may explicitly compute, 


2 2 
LAS. uL o o 
Var(Xı — X3) == gm 
n1 na 
Hence the standard error is given by, 
2 2 
c o 
Serr = = 2, (6.7) 
n1 ng 


When combined with the assumption that the data is normally distributed, we can derive the 
following confidence interval, 





(6.8) 


While this case is not often applicable in practice, it is useful to cover for pedagogical reasons. 
Due to the fact that the population variances are almost always unknown, the HypothesisTests 
package in Julia does not have a function for this case. However, for completeness, we evaluate 


equation |(6.8)| manually in Listing |6.3 


Listing 6.3: ¡CI for difference in population means with variances known 


using CSV, Distributions, HypothesisTests 


datal = CSV.read("../data/machinel.csv", header=false) [:,1] 
data2 = CSV.read("../data/machine2.csv", header=false) [:,1] 
xBarl, xBar2 = mean(datal), mean(data2) 

nl, n2 = length(datal), length (data2) 

egi. sige = 1.2, 1.6 

alpha = 0.05 

Z = quantile (Normal (),1-alpha/2) 


pala ia ("Calculating Fommalas “, (aril = :Baz2 = zesce (sigil*2/milstsilg2°2/ 02) y 
Balen = ssar? + sepas (sigl ae ale2 02 /m2) )) 








Calculating formula: (1.1016568035908845, 2.9159620096069574) 





This listing is similar to those previously covered in this chapter. The sample means and number 
of observations are calculated in lines 5-6. In line 7, we stipulate the standard deviations of both 
populations 1 and 2, as 1.2 and 1.6 respectively (since in this scenario the population variances are 
assumed known). In lines 11-12 the confidence interval |(6.8)| is evaluated manually and printed as 
output. 





Population Variances Unknown and Assumed Equal 


Typically, when considering cases consisting of two populations, the population variances are 
unknown. In such cases, a common and practical assumption is that the variances are equal, denoted 
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by o?. Based on this assumption, it is sensible to use both sample variances to estimate o?. This 
estimated variance using both samples is known as the pooled sample variance, and is given by, 


nı — 157 + (na — 1)52 
ni +n — 2 ` 





2_ | 
Sp = 


Upon closer inspection, it can be observed that the above is in fact a weighted average of the sample 
variances of the individual samples. 


In this case, it can be shown that, 


_Xı- Xa - (un — ua) 
Serr 





T (6.9) 


is distributed according to a T-distribution with nı -- n3 — 2 degrees of freedom, where the standard 


error is, 
mi 1 
Serr m Sp — + 
n1 ng 


Hence we arrive at the following confidence interval, 


1 1 
Tı — T2 + t1-2m-2 Sp ia + m (6.10) 





where sp is the square root of the observed pooled sample variance and ti-2n-2 is a quantile of a 
T-distribution with n — 2 degrees of freedom. 


The code in Listing [6.4] calculates the confidence interval |(6.10)| where it is assumed that the 
population variance is unknown. This is compared with the result from the HypothesisTests 
package using EqualVarianceTTest (). It can be observed that the results are in agreement. 





Listing 6.4: |CI for difference in means, variance unknown, assumed equal 


using CSV, Distributions, HypothesisTests 


datal = CSV.read("../data/machinel.csv", header=false) [:,1] 
data2 = CSV.read("../data/machine2.csv", header=false) [:,1] 
xBarl, xBar2 = mean(datal), mean(data2) 

nl, n2 = length(datal), length (data2) 

alpha = 0.05 

t = quanti le(iDist (nltn2=2)), l—alpha/2) 


sl, s2 = std(datal), std(data2) 
= Sere (((mil=1) m1 ^2 4 (u2—1)«92^2) (E IZ) 


prime lin (Cale uiletciioe, trebles Y, (Earl = BAr? — tSp soe (dad ar 1/02) e 
deis = Bar? + esla sous (al «p 15/029) 3) 
println("Using confint(): ", confint (EqualVarianceTTest (datal,data2),alpha)) 











Calculating formula: (1.1127539574575822, 2.90486485574026) 
Using confint() function: (1.1127539574575822, 2.90486485574026) 





In line 8, a T-distribution with n1+n2-2 degrees of freedom is used. In line 10 the sample standard 
deviations are calculated. In line 11, the pooled sample variance sP is calculated. In lines 13-14, 
((6.10)|is evaluated manually, while in line 15 the confint () function is used. 
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Population Variances Unknown and not Assumed Equal 


In certain cases, it may be appropriate to relax the assumption of equal population variances. 
In this case, the estimate for Serr is given by, 


BT Pa 


Serr = . 
ny ng 


This is due to the fact that the variance of the difference of two independent sample means is the 
sum of the variances of each of the means. Hence in this case the statistic |(6.9)| is adapted to, 
Xi- X4—- — 
pa Mia X27 (Qn Ha) (6.11) 
Be. Wee 
PI 4 2 
NY na 





It turns out that (6.11) is only T-distributed if the variances are equal, otherwise it isn't. Nev- 
ertheless, an approximate confidence interval is commonly used by approximating the distribution 


of |(6.11)| with a T-distribution. This is called the Satterthwaite approximation. 


The approximation suggests a T-distribution with a parameter (degrees of freedom) given via, 


2 
si, 83 
— + — 
nj ne 








v= E E (6.12) 
(irm) (33/12) 
ny — 1 na — 1 
Now it holds that, 
~ t(v). (6.13) 
approx 


That is, the random variable T from]|(6.11)lis approximately distributed according to a T-distribution 
with v degrees of freedom. Note that v does not need to be an integer. We investigate this 
approximation further in Listing [6.6] later. 


Using the Satterthwaite approximation, following steps similar to previous confidence intervals, 
and given |(6.13)| we arrive at the following confidence interval formula, 





st, S 
Tı — 2 toy F (6.14) 
ny na 


In Listing|6.5| we calculate the confidence interval [(6.14)] where it is assumed that the population 
variances are unknown and not assumed equal. We then compare the result to those resulting from 
the use of UnequalVarianceTTest () from the HypothesisTests package. It can be observed 
that the results are in agreement. 
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Listing 6.5: |CI for difference in means, variance unknown and unequal 


using CSV, Distributions, HypothesisTests 


datal = CSV.read("../data/machinel.csv", header=false) [:,1] 
data2 = CSV.read("../data/machine2.csv", header=false) [:,1] 
xBarl, xBar2 = mean(datal), mean (data2) 

sl, s2 = std(datal), std(data2) 

nl, n2 = length(datal), length (data2) 

alpha = 0.05 


w = (s1*°2fal + S2°2/a2)°2 / (( (sl*°2/fml)e2 / mli] + (9S2^2/m2)^2 f (a2) ) 





ite quantile (TDist (v) ,1-alpha/2) 


prime in (Caileulleiciioe; Fomawlaia Y, eari = Bar? = teere (Slaam «e 3202/02), 
xdBeurll = AS (ed ^2/uml r S2° 2/2) ))) 
aene dior (Metas contame d) e Y, confint (UnequalVarianceTTest (datal,data2),alpha)) 





Calculating formula: (1.0960161148824918, 2.9216026983153505) 
Using confint(): (1.0960161148824918, 2.9216026983153505) 





'The main difference in this code block from the previous code block is the calculation of the degrees 
of freedom, v, which is performed in line 10. In line 12 v is then used to derive the T-statistic t. In 
lines 14-15, equation [(6.14)|is evaluated manually, while in line 16 the confint () function is used. 





Exploring the Satterthwaite Approximation 


We now investigate the approximate distribution stated in|(6.13)| Observe that both sides of the 
"distributed as" (~) symbol are random variables which depend on the same random experiment. 
Hence the statement can be presented generally, as a case of the following format, 


X (w) ~ Fr), (6.15) 


where w is a point in the sample space (see Chapter 2). Here X(w) is a random variable, and F 
is a distribution that depends on a parameter h, which itself depends on w. In our case of the 
Satterthwaite approximation, h is given by That is, h can be thought of as v, which itself 
depends on the specific observations made for our two sample groups (a function of s, and sa). 


Now by recalling the inverse probability transform from Section we have that |(6.15)| is 
equivalent to, 


Fs (X(w)) ~ uniform(0, 1). (6.16) 


Hence in the case of the Satterthwaite approximation, we expect that hold approximately. 
This distributional relationship would not hold with the naive alternative of treating h as simply de- 
pendent on the number of observations made (n; and n2). Hence in this naive case, the distribution 
is not expected to be uniform. 


We investigate this in Listing [6.6] where we construct Figure[7.4] a Q-Q plot comparing T-values 
calculated from the Satterthwaite approximation |(6.13)|and T-values calculated via the naive equal 
variance case. See Section [4.4] for a description of Q-Q plots. 
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The results in Figure [7.4] indicate that the Satterthwaite approximation is a better approxima- 
tion than simply using the degrees of freedom. It can be observed that the data from the fixed v 
case deviates further from the 1:1 slope in comparison to the case where v was calculated based on 
each experiments sample observations, i.e. calculated from equation Hence the distribution 
of T-statistics from Satterthwaite calculated v's yields better results than the constant v case. 


Listing 6.6: |Analyzing the Satterthwaite approximation 


Statistics, Plots, Random; 


using Distributions, pyplot () 


Random. seed! (0) 


ind 
n2 


Bp 8 

307 15 
sigl) 
sig2) 


0, 
0, 


mula 
mu2, 
chisti 
CERSA 


sig, 
Sul 


= Normal (m 
(m 


(Ud 
= we; 


Normal 


= 10%6 
tdArray 


Array{Tuple{Float64, Float64}} (undef, N) 


datist, 32,11, m2) 


(si*2/m1 + s222/a2)%2 / ( (91^2/:91)^2/(m.—-1) + (92^2/m2)^2/(m2-1) ) 


for i in IL iN 
xlData 


Meinl (elise tl, mi) 

x2Data TEMEER E 2, m2) 

xlBar,x2Bar mean (x1Data) mean (x2Data) 

s1,s2 = std(xlData),std(x2Data) 

iE SEE (ESE CUE Sa / scopre (SIAP mil = ^22) 
tdArray[i] (Siem , clit (si, s2, imi, im2)) ) 


end 
Soma 





tdArray, by = first) 


invVal (v, 1) = quantile(TDist (v),1/(N+1)) 


xCoords 
yCoords1 
yCoords2 


fors in l 
xCoords 
yCoords 
yCoords 

end 


scatter (xCo 
scatter! (xC 
plo! (Silo, 
c= 
xl 
yl 


= Array{Float64} (undef,N) 


Array{Float64} (undef, N) 
Array{Float64} (undef, N) 


:N 


[i] 
JL [3k] 
ZEN 


LIES (iecusmricasy [41 1) 
= invVal (last (tdArray[i]), 
invVal (nl+n2-2, i) 


1) 


c=:blue, 
c=: red, 


label="Calculated v", msw=0) 
label="Fixed v", msw=0) 





ordern yeoornds 1, 
oords, yCoords2, 
LO}, T=r0 10l, 
:black, 1w=0.3, xlims=(-8,8), ylims=(-8,8), 
abel="Theoretical t-distribution quantiles", 
abel="Simulated t-distribution quantiles", legend=:topleft) 





ratio=:equal, label="", 
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@ Calculated v 
O Fixedv 


* f. 


Simulated t-distribution quantiles 





-6 3 0 3 6 
Theoretical t-distribution quantiles 


Figure 6.1: Q-Q plots of T-statistics from identical experiments, given 
Satterthwaite calculated v's, along with T-statistics given constant v. 





In lines 4-5 we set the means and standard deviations of the two underlying distributions, and the 
number of observations that will be made for each group. In line 8 we set the number of times we 
repeat the experiment, N. In line 9 we pre-allocate the array tdArray, in which each element is a 
tuple. The first element of each tuple will be the T-statistic calculated via[(6.11]] while the second will 
be the corresponding degrees of freedom calculated via|(6.12)| In lines 12-13 we define the function 
df (), which implements [(6.12]] In lines 15-22, we conduct N experiments, where for each we calculate 
the T-statistic, and the degrees of freedom. In line 23 sort! () is used to re-order tdArrray in 
ascending order according to the T-statistics via by = first. This is done so that we can construct 
the Q-Q plot. In line 25 the function invVal() is defined, which uses the quantile() function to 
perform the inverse probability transform on the degrees of freedom associated with each T-statistic 
for each experiment. Note that the number of quantiles is one more than the number of experiments, 
ie. N+1. In lines 31-35 the quantiles of our data are calculated. Here xCoords represents the 
T-statistic quantiles, and yCoords1 represents the quantiles of a T-distribution with v degrees of 
freedom, where v is calculated via [(6.12)] 'The array yCoords2 on the other hand represents the 
quantiles of a T-distribution with v degrees of freedom, where v = n4 + n3 — 2. Lines 37-42 plot the 


Q-Q plots creating Figure 





6.3 Confidence Intervals for Proportions 


In certain inference settings the parameter of interest is a population proportion. Examples 
include the proportion of females within an animal population, the proportion of customers that 
own two or more cars, or the proportion of baby turtles that survive the first day after hatching. In all 
such cases, one may view the proportion as either a characteristic of the population or alternatively 
as the probability of some event happening when randomly sampling from the population. 
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When carrying out inference for a proportion we assume that there exists some unknown pop- 
ulation proportion p € (0,1). We then sample an i.i.d. sample of observations /1,...,1,, where for 
the ?'th observation, J; = 0 if the event in question does not happen, and J; = 1 if the event occurs. 
For example, dealing with the proportion of females, we set I; = 0 if the ?th sample is not a female 
and J; = 1 if the ith sample is a female. 


A natural estimator for the proportion is then the sample mean of ,..., In which we denote, 
n 
208 
2 i=l 
= ——_. 6.17 
==, (6.17) 


In this case since the summands in the numerator are indicator variables, the sum is simply a count 
of the number of observations for which the event occurred. Hence we also call p, the proportion 
estimator. 


Now observe that each J; is a Bernoulli random variable with success probability p. Under the 
i.i.d. assumption this means that the numerator of (6.17) is binomially distributed with parameters 
n and p (see Section to review the binomial distribution). Hence, 


[S sim of se entem 


i=l i=1 














By combining (6.17) with the above, we have that, 





Elp] = p, and Var(p) = puc» 








(6.18) 





Hence pis an unbiased and consistent estimator of p. That is, on average p estimates p perfectly and 
if more observations are collected the variance of the estimator vanishes and p — p. Furthermore, 
we can use the central limit theorem to create a normal approximation for the distribution of p and 
yield an approximate confidence interval. To do so denote, 


p-p 
Vp(1 — p)/n 
This is a random variable that approximately follows a standard normal distribution. The approx- 
imation becomes exact as n grows. The same also holds for a slightly different random variable, 


Ln = 


Zn = E 


EM c. (6.19) 
Vp — p)/n 

This is because p is an unbiased and consistent estimator of p and thus replacing the p’s in the 
denominator of Z, with f to yield Zn does not significantly affect the distribution for large n. We 
now use the approximate normality of Z, to create a confidence interval for p. First observe that 
as a consequence of the approximate standard normal distribution, 


P(Za/2 < Zn € %1-a/2) 1 — a. (6.20) 
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We now use (6.19) in (6.20), along with the fact that 2/2 = —21-a/2 as follows, 





l-ax= P(—21-a/2 < Za < Z1—a/2) 





= P(-2%-a/2 < Taian < 2-4/2) 

= P(—-z1-a 2V P — p)/n € p-p € z1-aj2 V PC — B)/n) 
=P(—p—<i-gpypl=pin < =p < -P+ z1-aj2V ÊC — P) /n) 
= P(P— z1-aj2 V P0 — p)/n < p € Ê+ z1-aj2 V PC — P) /n). 


We thus arrive at the following (approximate) confidence interval for proportions formula, 




















057) 





(6.21) 


> 
Q 


- Z1—a/2 


Observe that this confidence interval formula agrees with the general form of (6.1), where the 
standard error depends only on the statistic p and is represented as, 
p(1— p 
OUS gn (6.22) 


Similar more complex confidence interval formulas also exist for the case of two populations. Say 
one is interested in comparing the proportion of females in two different sub-species populations 
of crocodiles. By sampling nı crocodiles from one sub-species and na crocodiles for the other 
subspecies, one can form the point estimators $1 and po, each in the same manner as (6.17). The 
point estimator for the difference in proportions is then simply fj — fo. An approximate 1 — a 
confidence interval for this parameter is, 


D pill—pi) , Pall — o 
pı — poc scan] + l ) (6.23) 
nı na 











Compare this formula with the general form (6.6) and other formulas for two populations presented 
in the previous section. You can see that (6.23) follows a similar stucture. We now return to 


examples and discussions of (6.21) and don’t discuss (6.23) further. 


In Listing |6.7| we demonstrate basic usage of the confidence interval for proportions formula 
(6.21). We consider the Grade column of purchaseData.csv. As the code demonstrates, here 
the possible grades are ‘A’—‘E’. We obtain a point estimate and a confidence interval for the pro- 
portion of observations with level ʻE’. You may modify line 11 of the code to carry out inference 
for other levels. Note that this code also deals with missing observations by culling the missing 
observations and only uses observations for which Grade is not missing. 
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Listing 6.7: |Confidence interval for a proportion 


using CSV, DataFrames, CategoricalArrays 


data = CSV.read("../data/purchaseData.csv", copycols = true) 
println("Levels of Grade: ", levels (data.Grade) ) 
print (ED t Oo orante e: no wa elis c ) 


n = sum(.! (ismissing. (data.Grade))) 
printil(Non assi dal oo Ms V da) 
data2 = dropmissing(data[:,[:Grade]],:Grade) 


"pm" 
E, 





gradeInQuestion = 
indicatorVector = data2.Grade .== gradeInQuestion 
numSuccess = sum(indicatorVector) 

phat = numSuccess/n 

serr = sqrt (phat (1-phat) /n) 





alpha = 0.05 

confidencePercent = 100» (1-alpha) 

zVal = quantile (Normal (),1-alpha/2) 

confInt = (phat - zValxserr, phat + zVal*serr) 


println("\nOut of $n non-missing observations, "x 

"SnumSuccess are at level $gradeInQuestion.") 
println("Hence a point estimate for the proportion "x 

"of grades at level $gradeInQuestion is Sphat.") 
println("A $confidencePercent$ confidence interval for "x 

"the proprotion of level $gradeInQuestion is: Mn$confInt.") 








Levels of Grade: ["A", "pm. "gm, "pt, "E"] 
Data points: 200 
Non-missing data points: 187 





Out of 187 non-missing observations, 61 are at level E. 
Hence a point estimate for the proportion of grades at level E is 0.3262. 
A 95.0$ confidence interval for the proprotion of level E is: 
(0.2590083767381328, 0.3933980403741667). 














Lines 3-5 load and describe the data, focusing on the Grade column. Note the use of levels () 
from CategoricalArrays. Lines 7-9 handle missing values. Note the use of *. ! ()’ to broadcast 
negation on the output of a broadcasted ismissing(). Summing this yields the number of (non- 
missing) observations n. The new data frame, data2 is comprised of a single variable Grade after 
dropmissing() is applied with :Grade as a second argument. In line 11 we choose to carry out 
proportion estimation for "E". Line 12 creates 74,...,1,. Line 14 calculates p and line 15 calculates 
Serr as in (6.22). Lines 17-20 determine the confidence interval (6.21), with confInt represented as 
a Tuple. The remainder of the code prints the output describing the results. 
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Sample Size Planning 





Denote the confidence interval (6.21) as p + E where E is the margin of error or half the width 
of the confidence interval, denoted by, 


ia =D) 
E = 210/24) — — 
n 
You may often want to plan an experiment, or a sampling scheme such that E is not too wide. 
For example, ‘not more than 0.1’. For this, you need to choose a sample size n prior to sampling. 
We now illustrate a crude yet effective way for such sample size planning. 


First observe that for typical values of a we have that 21_,/2 ~ 2. In fact, when a = 0.0455 
we have that z1—a/2 = 2 almost exactly. Values ranging between 1.5 and 2.5 are common for most 
chosen confidence levels in practice. Hence in general, crudely taking z;.4/5 as ‘2’ helps simplify 
expressions. 


Say we want E < e, e.g. with the maximal margin of error e = 0.1, or any other similar value. 
Then taking 21-,/2 = 2 for simplicity we get, 


Xl1-8 1-5 
QR 79 cc, or PLSD cn (6.24) 
n € 
Now also observe that z(1— x) is maximized at x = 1/2 with a maximal value of 1/4. Hence, 
pep 1 
a < e2 


This means that by taking n > €? we ensure (6.24). For seeking a whole number of observations 
we use the [-] ‘ceiling’ (rounding up) operator, to get the proportions sample size formula, 


n* = B l (6.25) 


e 


In Listing [6.8] we create a simple table implementing to get a sense for the magnitude of 
samples needed. As you can see for € = 0.1 we need 100 observations. However if we seek a more 
accurate confidence interval with € = 0.01 then 10,000 observations are needed! Again, keep in 
mind that these calculations are assuming «o = 0.0455. 


Listing 6.8: Sample size planning for proportions 


ss Gos che (Ooi, 0,09, 0.02, 0.01] 
n = ceil(1/eps^2) 
println("For epsilon = Seps set n = $n") 








For epsilon = 0.1 set n = 100.0 
For epsilon = 0.05 set n 400.0 
For epsilon = 0.02 set n = 2500.0 
For epsilon = 0.01 set n = 10000.0 


The listing is a straightforward implementation of (6.25). Observe the use of ceil() in line 3. 
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Figure 6.2: Approximation error of a confidence interval for proportion. The 
heatmaps are for various n and p combinations for a = 0.05. 


Validity of the Approximation 


'The key to the derivation of is the distributional approximation in (6.20). In many cases 
this approximation works well, however for small sample sizes n or values of p near 0 or 1, this is 
often too crude of an approximation. A consequence is that one may obtain a confidence interval 
that isn't actually a 1 — o confidence interval, but rather has a different coverage probability. 


One common rule of thumb used to decide if (6.21) is valid is to require that both the product 
np and the product n(1 — p) be at least 10. For p = 0.5 this rule specifies a minimal sample size of 
n = 20, and for other values of p higher values of n are required. 


How does such a rule come about? To explore this we now present a computational experiment, 
aiming to asses when (6.21) is valid. For this we explore a grid of n ranging from 5 to 50 and p 
in the interval [0.1,0.9]. For each combination we repeat N = 5,000 Monte Carlo experiments and 
calculate the following: 


(l-a) — o ide Pa EP. praan PB] (6.26) 
k=1 


This estimated difference of the actual coverage probability of the confidence interval and 
the desired confidence level 1 — o is a measure of the accuracy of the confidence level. We expect 
this difference to be almost 0 if the approximation is ‘good’. Otherwise, a higher absolute difference 
is observed. 


Listing [6.9] creates Figure [6.2] which presents the results of this simulation experiment for a = 
0.05. The left plot illustrates the estimated difference between the actual coverage probability and 
1—a. Observe that in general for p values around 0.5 there is less error than p values closer to 0 or 
1. Also, as expected when n is increased the error probabilities drop. There is also a periodic effect 
due to the fact that for small n, the proportion estimator p only falls on a small finite set of values. 
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The right hand plot of Figure compares the absolute value of to 0.04 (an ad-hoc 
tolerance that we selected). For (n,p) where the absolute error is less than 0.04 we can say the 
confidence error is ‘small’ and conclude that using the confidence interval formula is satisfactory. 
As seen from the plot, this occurs when p is closer to 0.5 and n is at around 20 or more. This may 
give some insight to the heuristic rule described above and generally agrees with it. 


Listing 6.9: [Coverage accuracy of a confidence interval for proportions 


using Random, Plots, Distributions, Measures; pyplot () 


= Diles 
alpha = 0.05 
confLevel = 1 - alpha 
Z = quantile (Normal (),1-alpha/2) 


function randCI(n,p) 
sample = rand(n) .< p 
pHat = sum(sample)/n 
serr = sqrt (pHat* (1-pHat)/n) 
(pHat -— zxserr, pHat + zxserr) 
end 
cover(p,ci) = ci[1] <= p && p <= ci[2] 


pGrid = 0.1:0.01:0.9 
nGrid 5313510 
rrs = zeros (length (nGrid),length (pGrid)) 





for i in 1:length(nGrid) 
ford sa ls lencia (PEETA) 
Random. seed! (0) 
im. T9 = mete a], cres] 
coverageRatio = sum([cover(p,randCI(n,p)) for _ in 1:N])/N 
errs[i,j] = confLevel - coverageRatio 
end 
end 
default(xlabel = "p", ylabel = "n" 
xticks =([1:5:length(pGrid);], minimum(pGrid):0.05:maximum(pGrid)), 
yticks =([1:5:length(nGrid);], minimum(nGrid) :5:maximum(nGrid) ) ) 


, 





pl = heatmap(errs, c=cgrad([:white, :black])) 
p2 = heatmap(abs. (errs) .<= 0.04, legend = false, c=cgrad([:black, :white])) 
plot(pl,p2, size = (1000,400), margin = 5mm) 








In line 3 we set the number of Monte Carlo repetitions, N. Lines 4-6 define constants for the confidence 
interval based on alpha from line 4. In lines 8-13 we define the function randCI () for generating a 
random sample and an associated confidence interval. The function cover () that we define in line 14 
checks if p is covered by the given confidence interval, ci. Lines 16-17 define the grid of p values 
and n values on which we estimate (6.26). In line 18 we initialize the matrix errs using zeros (). 
The simulation repetitions are in lines 20-27 where, after resetting the seed in line 22, for each (n, p) 
combination we compute the sum in (6.26) into coverageRatio in line 24 by composing cover () 
on randCI(). Then in line 25 we record the estimated difference in the matrix errs. Lines 29-35 


create Figure 
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6.4 Confidence Interval for the Variance of Normal Population 


We now consider confidence intervals when the parameter in question is the variance. We also 
use this as an example to show how model assumptions may strongly affect the accuracy of the 
confidence interval. Consider sampling from a population that follows a normal distribution. A 
point estimator for the population variance is the sample variance, 


P= : 52205 - XY. 


i—l 





As illustrated in Section [5.2] when multiplied by the constant (n — 1)/0?, the sample variance 
follows a chi-squared distribution with n — 1 degrees of freedom, 
(n — 1)8? 


2 
c? Y Xn-1: 


Therefore, denoting the y-quantile of this distribution via Konei we have, 


n—1)S? 
ICM « LE < oa) =1-a. (6.27) 


Hence we can re-arrange to obtain a two-sided 100(1 — a)% confidence interval for the variance 

of a normal population where we denote the observed estimator by s?: 
n — 1)s? n — 1)s? 
MD y ¿MD 


3 
X1-2 m-1 X2m-1 


(6.28) 


Note that [(6.27)| only holds when sampling from data that is normally distributed. If the data 
is not normally distributed, then our confidence intervals will be inaccurate. Such sensitivity to 
assumptions is explored later in this section. However first we demonstrate a simple example for 


using the confidence interval (6.28) in Listing 


Listing 6.10: |Confidence interval for the variance 


using CSV, Distributions, HypothesisTests 


data = CSV.read("../data/machinel.csv", header=false) [:,1] 

n, S, alpha = length (data), std(data), 0.1 

ci => A (Lir) 282 /cuenrcille (Caise a-p alana 2) y 
(n-1)*xs^2/quantile(Chisq(n-1),alpha/2) ) 


println("Point estimate for the variance: ", s^2) 
println("Confidence interval for the variance: ", ci) 





Point estimate for the variance: 1.3928282706110504 
Confidence interval for the variance: (0.8779243703322502, 2.6157658366723124) 


The code is similar to Listing [6.2]and uses the same dataset. Lines 5-6 implement (6.28) based on the 
sample variance standard deviation s and a Chi-squared distribution, Chisq(). 
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Figure 6.3: PDF's of the normal and logistic distributions, along with 
histograms of the sample variances from the corresponding distributions. 


Sensitivity of the Normality Assumption 


We now look at the sensitivity of the normality assumption on the confidence interval for the 
variance. As part of this example we first introduce the logistic distribution which has a “bell 
curved" shape somewhat similar to the normal distribution. It is defined by the location and scale 
parameters, u and 7. The PDF of the logistic distribution is, 


Ha) = — a (6.29) 





with the variance given by n?7?/3. 


In Listing we create Figure [6.3] where in the left plot, the PDF of a normal distribution 
with mean y = 2 and standard deviation o = 3 is plotted against that of a logistic distribution with 


the same mean and variance. To achieve that we require 7277/3 = o? and hence, 


n= 3, (6.30) 


While both of these symmetric (about the mean) distributions share the same mean and variance 
and hence are somewhat similar, in the right plot, we show via Monte Carlo that the distributions of 
their sample variances with n — 15 are actually significantly different. This gives a first hint at the 
fact that the confidence interval formula may be very sensitive to the normality assumption. 
Later, in the example that follows, we investigate the effect of this on the confidence interval. 
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Listing 6.11: [Comparison of sample variance distributions 


using Distributions, Plots; pyplot () 


MUS Le ES 

eta = sqrt (3)*sig/pi 

in, MN = 15, 2077 

dNormal = Normal (mu, sig) 
dLogistic = Logistic(mu, eta) 
cielo = = gs 112 


sNormal = [var(rand(dNormal,n)) for eet CETT 


sLogistic = [var(rand(dLogistic,n)) for _ in 1:N] 


pl = plot (xGrid, pdf. (dNormal,xGrid), c=:blue, label="Normal") 
pl = plot! (xGrid, pdf. (dLogistic,xGrid), c=:red, label="Logistic", 
xlabel-"x",ylabel-"Density", xlims- (-8,12), ylims=(0,0.16) ) 





p2 = stephist(sNormal, bins=200, c=:blue, normed-true, label="Normal") 
p2 stephist! (sLogistic, bins=200, c=:red, normed=true, label="Logistic", 
xlabel="Sample Variance", ylabel="Density", xlims=(0,30), ylims=(0,0.14)) 





plot (p1, p2, size=(800, 400)) 





In line 3 we define the mean and standard deviation that will be used for both distributions. Then in 
line 4 we calculate eta according to (6.30). In line 5 the number of sample observations, n, and total 
number of experiments, N, are specified. In lines 6-7 we define the two distributions with matched 
moments and variance. Note that the Julia Logistic() function uses the same parametrization 
as that of equation In lines 10-11 comprehensions are used to generate N sample variances 
from the normal and logistic distributions dNormal and dLogistic, with the values assigned to 
the arrays sNormal and sLogistic respectively. The remainder of the code creates the plots with 
lines 13-15 creating the left plot by using pdf () broadcasted over xGrid and lines 17-19 creating 
histograms of the sample variances. 











Having seen that the distribution of the sample variance heavily depends on the shape of the 
actual distribution, we now investigate the effect that this has on the accuracy of the confidence 
interval. Specifically we show that while usage of the confidence interval formula [(6.28)] yields l-a 
coverage for normally distributed data, it strongly deviates for the logistic distribution case. 


In Listing [6.12] we cycle through different values of a from 0.001 to 0.1, and for each value, we 
perform N of the following identical experiments: calculate the sample variance of n observations 
and evaluate the confidence interval We then calculate the proportion of times that the 
actual (unknown) variance of the distribution is contained within the confidence interval in a similar 
manner to what we did in the context of proportions in (6.26). The effective a values are then plotted 
against the actual values in Figure 


It can be observed that in the case of the normal distribution, the simulated a values align with 
those of the actual œ used. However, in the case of logistic distribution there is a strong discrepancy. 
This illustrates that model assumptions are critical for the correctness of the confidence interval 
Note that in general, confidence intervals for the mean such as (6.5), would be less sensitive 


to model assumptions than the confidence interval for the variance. 
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Figure 6.4: Actual o values vs. œ values used in confidence intervals. 


using Distributions, Plots, LaTeXStrings; pyplot () 


iil, Sug = 2, 3 

eta = sqrt (3)*sig/pi 

te, dj o, Qe 

dNormal = Normal (mu, sig) 
dLogistic Logistic (mu, eta) 
alphaUsed (9) OWLeO. 00130, 1 


function alphaSimulator(dist, n, alpha) 
popVar = var (dist) 
coverageCount = 0 
for _ in 1:N 
sVar = var(rand(dist, n)) 
i = n= l) s svar / eva (Clase (ii) , l-alpha/ 2) 
U= (n- 1) + sVar / quantile (Chisq(n-1),alpha/2) 


coverageCount += L < popVar && popVar < U 
end 


1 - coverageCount/N 
end 


scatter(alphaUsed, alphaSimulator. (dNormal,n,alphaUsed), 
c=:blue, msw=0, label="Normal") 

scatter! (alphaUsed, alphaSimulator. (dLogistic, n, alphaUsed), 
c=:red, msw=0, label="Logistic") 

pilor (CO, Os tll, 0, Os |,pe=slolacke, lalola il sire 
xlabel=L"\alpha"*" used", ylabel=L"\alpha"*" actual", 
legend=:topleft, xlim=(0,0.1), ylims=(0,0.2)) 











Lines 3-7 are identical to the previous listing. In line 8 we define a grid of œ values over which we 
carry out the experiment. Lines 10-20 define the function alphaSimulator (). This function takes a 
distribution, the total number of sample observations and a value of alpha as input. It then generates 
N separate confidence intervals for the variance via equation and returns the corresponding 
proportion of times the confidence intervals do not contain the actual variance of the distribution. We 


apply alphaSimulator() directly in lines 22 and 24 as part of the scatter plots. Lines 26-28 plot 
the 1:1 line on which o used equals « actual. 
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Figure 6.5: A single confidence interval for the mean generated by 
bootstrapped data. 


6.5 Bootstrap Confidence Intervals 


When developing confidence intervals, the main goal is to make some sort of inference about the 
population based on sample data. However, in some cases a statistical model may not be readily 
available, or as we saw in Listing [6.9] and Listing [6.12] model error may cause inaccuracies. Hence 
1t is useful to have an alternative method for finding confidence intervals. One such general method 
is the method of bootstrap confidence intervals. 


Bootstrap, also called empirical bootstrap, is a useful technique which relies on resampling from 
the observed data z4,...,2; in order to empirically construct the distribution of the point estimator. 
One way in which this resampling can be conducted is to apply the inverse probability transform 
on the empirical cumulative distribution function. However from an implementation perspective a 


simpler alternative is to consider the data points 21,...,%, ,and then randomly sample n discrete 
uniform indexes, j1,...,jn each in the range {1,...,n}. The resampled data denoted via x* = 
(£1... ., £% ) is then, 

ue as eens lich 


That is, each point in the resampled data is a random observation from the original data, where 
we allow to ‘sample with replacement’. In Julia, if the sample is represented by an array called 
sampleData, say of length n, we create an instance of x* by executing rand (sampleData,n). 
This method of the rand() function will uniformly sample n random copies of elements from 
sampleData with replacement. 


The idea of empirical bootstrap is now to repeat the resampling a large number of times, say 
N, and for each resampled data vector, z*(1),..., z*(N) to compute the parameter estimate. If the 
parameter estimate is denoted by the function h : R” — R, then we end up with values, 


CL EU es CD) 
i 1 


=h ; Dr 
h* (2) = (e292) 


h* (N) — h(a1(2),... 2% (N)). 
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A bootstrap confidence interval is then determined by computing the respective lower and upper 
($,1— $) quantiles of the sequence h*(1),...,h*(N). The beauty of this method is that if n is not 
too small, the resulting quantiles are quite close to the actual quantiles of the distribution of the 
point estimate. Hence in general, bootstrap confidence intervals provide a very generic and general 
method to obtain confidence intervals for parameters in question. 


In Listing [6.13] we generate a bootstrap confidence interval for the mean, using the same data 
that was used in Section The listing also creates Figure which illustrates the empirical 
distribution of h*(1),...,h*(N) where h(-) is the sample mean. The 90% confidence interval is 
(52.53, 53.38). Compare this with the output of Listing [6.2] where the 90% confidence interval using 
the T-distribution and formula for the population mean was (52.5, 53.41). Clearly, the results 
are similar. While bootstrap requires more computational effort and doesn't come with a neat 
simple formula as (6.5), it is useful because we can use it to generate confidence intervals for other 
point estimators. For example, as observed in the output, we also generate a 90% confidence interval 
for the median. 


Listing 6.13: |Bootstrap confidence interval 


using Random, CSV, Distributions, Plots; pyplot() 
Random.seed! (0) 





sampleData = CSV.read("../data/machinel.csv", header-false)[:,1] 
n, N = length(sampleData), 10^6 
alpha = 0.1 


Oo -q1O» 0v 4 C5 l2 — 


bootstrapSampleMeans = [mean(rand(sampleData, n)) for i in 1:N] 
Lmean = quantile(bootstrapSampleMeans, alpha/2) 
Umean = quantile(bootstrapSampleMeans, 1-alpha/2) 


bootstrapSampleMedians = [median(rand(sampleData, n)) for i in 1:N] 
Lmed = quantile(bootstrapSampleMedians, alpha/2) 
Umed = quantile (bootstrapSampleMedians, 1-alpha/2) 





println("Bootstrap confidence interval for the mean: ", (Lmean, Umean) ) 


println("Bootstrap confidence interval for the median: ", (Lmed, Umed) ) 


stephist (bootstrapSampleMeans, bins=1000, c=:blue, 
normed=true, label="Sample \nmeans") 
plot! ({[Lmean, Lmean], [0,2], c=:black, ls=:dash, label="90% CI") 
plot! ({Umean, Umean], [0,2],c=:black, ls=:dash, label="", 
xlims=(52,54), ylims=(0,2), xlabel="Sample Means", ylabel="Density") 








Bootstrap confidence interval for the mean: (52.530497748, 53.376643266) 
Bootstrap confidence interval for the median: (52.373195891, 53.49007500) 





In line 4 we load our sample observations, and store them in the array sampleData. In line 5, the 
total number of sample observations is assigned to n and the number of repetitions of the bootstrap, 
N, is specified. In line 6 we specify the level of our confidence interval alpha. In line 8 we generate N 
bootstrapped sample means which are assigned to the array bootstrapSampleMeans. In lines 9- 
10 the lower and upper quantiles of our bootstrapped sample data is calculated, and stored as the 
variables Lmean and Umean respectively. Then lines 12-14 repeat the process for the sample median 
using median (). The resulting confidence intervals are printed in lines 16-17. Lines 19-23 plot a 
histogram of bootstrapped sample means with an illustration of the resulting confidence interval. 
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One may ask, how accurate are bootstrap confidence intervals? We now carry out a compu- 
tational experiment and see that if the number of sample observations is not very large, then the 
coverage probability of bootstrapped confidence interval is only approximately 1—a, but not exactly. 
However as the sample size n grows the coverage probability converges to the desired 1 — a. 


In Listing we create a series of confidence intervals based on different numbers of sample 
observations from an exponential distribution with A = 0.1. Our confidence intervals are for the 
median which theoretically equals log(2)/A = 6.931. By increasing the sample size n and repeating 
many sampling scenarios, M = 10°, each with a bootstrap computation using N = 10+, we estimate 
the coverage probability. We see that as n increases the coverage probability approaches 1 — a. 


Listing 6.14: |Coverage probability for bootstrap confidence intervals 


using Random, Distributions 
Random. seed! (0) 


lambda = 0.1 
dist = Exponential (1/lambda) 
actualMedian = median(dist) 





Oo -q1O»C0U 4 C05 rb. - 


M = LOS 

N = 10%4 

nRange = 2:2:10 
alpha = 0.05 


for n in nRange 

coverageCount = 0 

for in EM 
sampleData = rand(dist, n) 
bootstrapSampleMeans = [median (rand (sampleData, n)) for _ 
L = quantile(bootstrapSampleMeans, alpha/2) 
U = quantile(bootstrapSampleMeans, 1-alpha/2) 
coverageCount += L < actualMedian && actualMedian < U 

end 

println("n = ",n,"\t coverage = ", coverageCount/M) 





n= 2 coverage = 0.483 
n= 4 coverage = 0.881 
n= 6 coverage = 0.936 
n= 8 coverage = 0.939 
n = 10 coverage = 0.949 





In line 4 we specify A. In line 5 we create dist remembering that in Julia the exponential distribution 
is parameterized by the inverse of A. The actual median is then computed in line 6 via median () 
method implemented in the Distributions package. In line 8 we specify the number of repetitions 
we make to evaluate the coverage probability, M. Then in line 9 the number of bootstrap repetitions, N, 
is specified. In line 10 we specify the range of number of observations to consider, nRange. We then 
loop over this range in lines 13-23 where in each iteration we count coverageCount, counting the 
number of times the bootstrap confidence interval contained actualMedian. Each actual bootstrap 
confidence interval procedure is in lines 17-19. Note the boolean value to the right of += in line 20 
which evaluates to true if the median is covered by (L,U). 
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Figure 6.6: As the number of observations increases, 
the width of the prediction interval decreases to a constant. 


6.6 Prediction Intervals 


We now look at the concept of a prediction interval which is somewhat related to confidence 
intervals, however has a different meaning. A prediction interval tells us a predicted range that a 
single next observation of data is expected to fall within. This differs from a confidence interval 
which indicates how confident we are of a particular parameter that we are trying to estimate. For 
a given distributional model, the bounds of a prediction interval are always wider than those of 
a confidence interval, as the prediction interval must account for the uncertainty in knowing the 
population mean, as well as the spread of the data due to variance. 


The example that we use is for a sequence of data points 11,12,13,..., which come from a 
normal distribution and are assumed i.d. Further assume that we observed 71,...,2%, but have 
not yet observed Xn+1. Note that we use ‘little’ x for values observed and ‘upper case’ X for (yet) 
unobserved random variables. 


In this case, a 100(1 — a)% prediction interval for the single future observation, Xn+1, is given 


by, 
E 1 = 1 
T — 1-2 4-18 1+- < Angi < T + tja n—1$ 1+ a (6.31) 
, n Ra n 


where, z and s are respectively the sample mean and sample standard deviation computed from 
£1,.-.-, Zn. Note that as the number of observations, n, increases, the bounds of the prediction 
interval decreases towards, 

s < Xp ST +21- 


T-— z1 s. (6.32) 


NIR 
NIR 


In Listing we illustrate the use of prediction intervals based on a series of observations 
made from a normal distribution. We start with n = 2 observations and calculate the corresponding 
prediction interval for the next observation. The sample size n is then progressively increased, and 
the prediction interval for each next observation calculated for each subsequent case. The listing 
creates Figure which illustrates that as the number of observations increases, the prediction 
interval width decreases. Ultimately it follows |{6.32}] and has an expected width close to 2 21—29 0. 
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Listing 6.15: |Prediction interval with unknown population mean and variance 


using Random, Statistics, Distributions, Plots; pyplot() 
Random. seed! (0) 


MU, Sig = BO, 5 

dist = Normal(mu, sig) 
alpha = 0.01 

nMax = 40 


0 JOANA 


observations = rand(dist,1) 
piLarray, piUarray = [], [] 


for _ in 2:nMax 
xNew = rand(dist) 
push! (observations, xNew) 


xbar, sd = mean(observations), std(observations) 
n = length (observations) 

tVal = quantile(TDist (n-1),1-alpha/2) 

delta = tVal x sd + sqrt (1+1/n) 

piL, piU = xbar - delta, xbar + delta 


push!(piLarray,piL); push! (piUarray,piU) 
end 


Scatter(1:nMax, observations, 

c=:blue, msw=0, label="Observations") 
¡aloe! (Beier, jswWeeieeyy, 

c=:red, shape=:xcross, msw=0, label="Prediction Interval") 
plot! (2:nMax, piLarray, 

c=:red, shape=:xcross, msw=0, label="", 

ylims=(0,100), xlabel="Number of observations", ylabel="Value") 











In line 4-7 we setup the distributional parameters, choose a, and also set the limiting number of 
observations we will make, nMax. In line 9 we sample the first sample observation and store it in the 
array observations. In line 10, we create the arrays piLarray and piUarray, which will be used 
to store the lower and upper bounds of the prediction interval. Lines 12-23 contain the main logic 
of this example where in lines 13-14 a new date point is obtained and stored, in lines 16-18 updated 
prediction interval is calculated and in line 22, the prediction interval is stored for plotting afterwards. 
Lines 25-31 create Figure Observe the use of shpae=:xcross for setting tick marks, 
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Figure 6.7: The posterior distribution and three forms of 90% credible 
intervals. 


6.7 Credible Intervals 


This section presents two related concepts, credible intervals and intervals on asymmetric dis- 
tributions. The concept of credible intervals comes from the field of Bayesian statistics which we 
overviewed in Section [5.7] of Chapter It is the analog of a confidence interval in the Bayesian 
setting. Before we explain this concept further we first deal with various ways of finding intervals 
on asymmetric distributions. 


In general, we often need to find an interval [£, u] such that given some probability density f(x), 
the interval satisfies, 


[ f(z)dx —1- a. (6.33) 


'This is needed for confidence intervals which were the focus for most of this chapter, for prediction 
intervals which were discussed in the previous section, or for credible intervals which are discussed 
below. However, as long as o < 1 there is never a single unique interval [¢, u] that satisfies (6.33). 


In certain cases there is a ^natural" interval. For example for the normal distribution, using equal 
tail quantiles is natural. We do this by choosing £ = z,/; and u = 21_q/2 as was used throughout 
this chapter. Similarly for a T-distribution with n — 1 degrees of freedom we used £ = ta/2n-1 
and u = t4 4/2,,1,. Such choices for |l, u] are natural because in both the case of a normal and 
the T-distribution, the density f(x) is symmetric about the mean. In such cases, the mean is also 
the median and further since the density is unimodal (has a single maximum) then the mean and 
median are also the mode. With such symmetry and unimodallity, while there exist other choices 
of £ and u, they don't appeal to applications. 


However, consider asymmetric distributions such as the one presented in Figure [6.7] In this case, 
there isn't an immediate “natural” choice for £ and u. Without going into the actual meaning of 
the density in the figure just yet, you can already observe three different plotted intervals. They all 
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satisfy (6.33) A blue classic CI, a red equal tail CI, and a green highest density CI. We now describe 


each of these intervals. 


Classic interval — This type of interval has the mode of the density (assuming the density is 
unimodal), at its center between / and u. An alternative is to use mean or median at the 
center. That is, assuming the centrality measure (mode, median or mean) is m, we have, 
|£, u] = [m — E,m + E]. One way to define E is via 

m+e 
E=max{e>0: ji f(z)dx € 1— a]. (6.34) 
m-—& 
That is, we can search for the highest e such that the integral over [m — &, m + e] doesn't 
exceed 1 — a. This is crudely implemented in Listing in the function classicalCI(). 


Equal tail interval — This type of interval simply sets / and u as the a/2 and 1 — a/2 quantiles 
respectively. Namely, 


af f(z)ds, and gf f(x) da. (6.35) 


Such an interval was implicitly used with the asymmetric Chi-squared distribution in Sec- 


tion [6.4] See for example formula (6.27). 


Highest density interval — This type of interval seeks to cover the part of the support that is 
most probable. Define the smallest probability densities falling over an interval [£,u] via 
M(é,u) = min{ f(x) : x € [£,u]). Then the highest density interval seeks for an interval 
[£, u] that satisfies while maximizing M(Z, u). A consequence is that if the density is 
unimodal then this highest density interval is also the narrowest possible confidence interval. 


There are multiple computational ways to find such a confidence interval. In Listing [6.16] the 
function highestDensityCI() crudely does so by starting with a high density value and 
decreasing it gradually while seeking for the associated interval [£, u]. An alternative would be 
to gradually increment £ each time finding a corresponding u that satisfies and within 
this search to choose the interval that minimizes the width u — £. 


For a symmetric and unimodal distribution such as the Normal distribution or T-distribution, 
all three of these confidence intervals agree. However in general they don't. In a Bayesian context 
as we describe below, one often prefers the highest density interval. However in other settings, equal 
tail intervals are common. For example when considering confidence intervals for the variance, we 
used equal tail intervals in (6.27) because it yielded the simple confidence interval formula 
which uses tabulated quantiles (of chi-squared distributions). 


We now explain credible intervals These come instead of confidence intervals in a Bayesian 
setting. Recall that in the Bayesian setting, we treat the unknown parameter, 0, as a random 
variable. As described in Section the process of inference is based on observing data, r — 
(21,..., Zn) and fusing it with the prior distribution f (0) to obtain the posterior distribution f(60 | x). 
Here too, as in the frequentist case, we may wish to describe an interval where it is likely that our 
parameter lies. Then for a fixed confidence level, 1 — a, seek [£, u] such that, 


f (192 -1-2. 
£ 
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Compare this with of Chapter |5| where [L, U] denotes the confidence interval. In the 
Bayesian case, £ and u are deterministic values determined from the prior distribution of the random 
0, where as in the frequentist case, 0 is deterministic with L and U random. This is a conceptual 
difference between confidence intervals and credible intervals. However practically there isn't a 
difference. 


Nevertheless, in a Bayesian context, unless dealing with special cases of conjugacy (see List- 
ing 5.19), the posterior distribution of the parameter is often only available numerically via MCMC 
or similar methods (see Listing|5.20). Hence there is no general motivation to use equal tail intervals 
or classical intervals. Instead, highest density intervals are often a prime choice. 


In Listing [6.16] we generate Figure [5.7] and compute alternative credible intervals using classical, 
equal tail, and highest density methods. We deal with the same small data set that was used in 
Section [5.7] of Chapter Here the unknown parameter is A, the mean of a Poisson distribution. 
In this simple example, for simplicity we use gamma-Poisson conjugacy and thus update hyper- 


parameters of with simple rules as in (5.24) of Chapter 


Observe that by design all three confidence intervals have the same 1 — a = 0.9 coverage 
probability. Slight differences only appear due to numerical inaccuracy. Also note that the width of 
the highest density is lowest, at 1.102 Indeed, there isn't a 90% confidence interval over this prior 
distribution narrower. We finally note that our implementation aims to be simple and readable but 
not the most time-efficient nor the most numerically precise. 
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Listing 6.16: |Credible intervals on a posterior distribution 


using Distributions, Plots, LaTeXStrings; pyplot () 


alpha, beta E 


- 8, 
data = [2/1,0,0, 10,212 9724/00 322730 


newAlpha, newBeta = alpha + sum(data), beta + length (data) 
post = Gamma (newAlpha, 1/newBeta) 


xGrid = quantile (post,0.01):0.001:quantile (post,0.99) 
significance = 0.9; halfAlpha = (1-significance)/2 


coverage (1,u) = cdf (post,u) - cdf (post,1) 


functional (ebsites) 
1, u = mode (dist) mode (dist) 
bestl, bestu = 1, u 
while  coverage(l,u) < significance 
1 -= 0.00001; u += 0.00001 
end 
(1,u) 
end 
equalTailCI (dist) = (quantile(dist,halfAlpha), quantile(dist,1-halfAlpha)) 
function highestDensityCI (dist) 
height = 0.999 = maximum(pdf. (dist,xGrid)) 
l,u = mode (dist),mode (dist) 
while coverage(l,u) <= significance 
range = filter(theta -> pdf(dist,theta) > height, 
1,u = minimum(range), maximum(range) 
height -= 0.00001 
end 
(1,u) 
end 





ii, wil = elaseicalci ses) 

12, u2 = equalTailCI (post) 

13, u3 = highestDensityCI (post) 

¡acia (elas ste Y, (il ai), Neme Y, mila, 
"\tCoverage: ", coverage (11,u1)) 

prin aqui earlas Wa A O AZ la, 

"\tCoverage: ", coverage (12,u2)) 

jorealionedlin (Visiieinesic Cclaasnicys VY, SAUS AS SA 
"\tCoverage: ", coverage (13,u3) ) 





(mcr, ale. (post, Erie),  vrtlers=(080. 2598 L.29) y 
c=:black, label="Gamma Posterior Distribution", 
slimsS (Lot, 2.9), Vlime=(=0,4, 1.29) ) 

EJ CI obo (250. 1,=0.11), lal Velasare (QU 
c=:blue, shape=:vline, ms=16) 

eT IE ZI OZ ao A raat (ur. 
c=:red, shape=:vline, ms=16) 

Es, as y [20.3,=0 31, abel isaleleeu Density Cr 
c=:green, shape=:vline, ms=16, xlabel=L"\lambda", ylabel="Density") 























Classical: (1.44, 2.56) Width: 1.1146 Coverage: 0.90000 
Equal tails: (1.53, 2.64) Width: 1.1081 Coverage: 0.89999 
Highest density: (1.51, 2.60) Width: 1.1020 Coverage: 0.90018 
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In line 3 we define the prior hyper-parameters, alpha and beta. Line 4 defines the observations. 
In line 6 the posterior hyper-parameters are calculated using the Poisson-gamma conjugacy update 
rule. In line 7 we create an object, post, for the posterior distribution. In line 9 we define a fine 
grid for carrying out computation. In line 10 we define the confidence level 1 — a via significance 
and also use halfAlpha to denote a/2. In line 12 the function coverage () is defined to compute 
the coverage probability in the posterior distribution on the interval ranging between 1 and u. Lines 
14-32 define the three main functions of this code block, classicalCI(), equalTailCI (), and 
highestDensityCI (), each returning the respective confidence/credible interval, classical, equal 
tail, and highest density. The implementation of classicalCI() in lines 14-21 starts with (1,u) 
at the mode of the distribution and then expands outwards in small steps until the desired coverage is 
reached. The implementation of equalTailCI() inline 22 simply returns the a/2 and 1— 0/2 quan- 
tiles. The implementation of highestDensityCI () in lines 23-32 works by starting with height at 
the maximal possible value and decreasing it until coverage is satisfied. In this implementation, in each 
iteration we use filter () in line 27 with the anonymous function theta -> pdf (dist, theta) 
» height. This sets range to be the array of all values of theta for which the density exceeds 
height. In lines 34-36, we use our confidence/credible interval functions to obtain credible intervals 
for the posterior distribution. The intervals, their widths and coverage probabilities are printed in 
lines 37-42. Then lines 44-52 generate Figure [6.7] Note the use of : v1ine with a specification of ms 
to create the confidence intervals in the figure. 





Chapter 7 


Hypothesis Testing - DRAFT 


In this chapter we explore hypothesis testing through a few specific practical hypothesis tests. 
Recall the general hypothesis test formulation first introduced in Section where we partition 
the parameter space O as follows, 


Ho : 0 € Oo, H1:0€604. 
One of the most common cases for a single population is to consider 0 as u, the population mean, 


in which case O = R. Often, we wish to test if the population mean is equal to some value, 4/0, 
hence we can construct a two sided hypothesis test as follows, 


Ag: u = po, H; sse po. (7.1) 
However, one could instead chose to construct a one sided hypothesis test, as, 

Ho: < po, H; : u > o, (7-2) 
or alternatively, in the opposite direction, 

Ho: u > po, H; w= uo, (7.3) 


where the choice of setting up |(7.1)}|(7.2)} or |(7.3)| depends on the context of the problem. 


As covered in Section once the hypothesis is established, the general approach involves 
calculating the test statistic, along with the corresponding p-value, and then finally making some 
statement about the null hypothesis based on some chosen level of significance. In this chapter we 
present some specific common examples often used in practice. 


In Section|7.1| we introduce hypothesis testing via several examples involving a single population. 
In Section we present extensions of these concepts and related ideas by looking at inference 
for the difference in means of two populations. In Section we focus on methods of Analysis 
of Variance (ANOVA). Then in Section [7.4] we explore Chi-squared tests and Kolmogorov-Smirnov 
tests. These latter two procedures are often used to assess goodness of fit, independence, or both. 
We then close with Section[7.5] where we illustrate how power curves can aid in experimental design. 


As in the previous chapters, we try to strike a balance between use of HypothesisTests and 
reproducing results from fundamental calculations, with the purpose of highlighting key phenomena. 
Several of the examples make use of the datasets machinel.csv and machine2.csv, which 
represent the diameter (in mm) of pipes produced via two separate machines. 
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Z ~ N(0,1) 
0 


—1.96 1.96 
Figure 7.1: The standard normal distribution and rejection regions for the 
two sided hypothesis test |(7.4)|at significance level a = 5%. 


7.1 Single Sample Hypothesis Tests for the Mean 


As an introduction, consider the case where we wish to make inference on the mean diameter of 
pipes produced by a machine. Specifically, assume that we are interested in checking if the machine 
is producing pipes of the specified diameter uo = 52.2 mm. In this case, using a hypothesis testing 
methodology, we may wish to set-up the hypothesis as, 


Hg: u = 52.2, H; : u # 52.2. (7.4) 


Here, Ho represents the situation where the machine is functioning properly, and deviation from Ho 
in either the positive or negative direction is captured by Hi, which represents that the machine is 
malfunctioning. Alternatively, one could have treated uo = 52.2 as a specified upper limit on the 
pipe diameter, in which case the hypothesis would be formulated as, 


Ho : u < 52.2, Ay : u > 52.2. (7.5) 


Similarly, one could envision a case where|(7.3)] was used instead. In most of this chapter, we do not 
dive deeply into the aspects of formulating the hypotheses themselves but rather the hypotheses 
are introduced and treated as given. If you are interested in the experimental design aspect of 
hypothesis testing, you may consult or similar texts. 


Once the hypothesis is formulated, the next step is the collection of data, which in this section 
is taken from machinel.csv. We now separate the inference of the mean of a single population 
into the two cases of variance known and unknown, similarly to what was done in Section 
Note also that, similarly to Chapter [6] it is assumed that the observations X4,..., Xn are normally 
distributed. Finally, at the end of this section, we consider a simple non-parametric test, where we 
make no assumptions about the distribution of the observations. 


Population Variance Known 


Consider the case where we wish to test whether a single machine in a factory is producing pipes 
of a specified diameter. For this example, we set up the hypothesis as two sided according to |((7.4)| 
and assume that c is a known value. Recall from Section [5.2] that, under Ho, X follows a normal 
distribution with mean pig and variance o?/n. Hence it holds that under Ho the Z-statistic, 

X — Ho 
Z= 7.6 
e, (7.6) 
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Z ~ N(0,1) 
0 


1.64 


Figure 7.2: The standard normal distribution and rejection region for the one 
sided hypothesis test |(7.5)| at significance level a = 5%. 


follows a standard normal distribution. 


As we will see through the various examples in this chapter, the test statistic often follows a 
general form similar to that of In this case, under the null hypothesis, the random variable Z 
is normally distributed, and hence to carry out a hypothesis test we observe its position relative to a 
standard normal distribution. Specifically, we check if it lies within the rejection region or not, and 
if it does, we reject the null hypothesis, otherwise we don’t. In Figure [7.1] we present the rejection 
region corresponding to a = 5%. It is obtained by considering the a/2 and 1 — a/2 quantiles of the 
standard normal distribution. 


With the hypothesis test and rejection region specified, we are ready to collect data, calculate 
the test statistic, and make a conclusion. For this example, the data is taken from machinel.csv 
where we assume c = 1.2 (ie. is known). After collecting the data, the observed Z-statistic is 
calculated via, 

T — HO 
z= : 7.7 
EN (7.7) 
Since this is a two sided test, we aim for a symmetric rejection region. We then reject Ho if 
|z| > 21~a/2 Where z; 4/5 is the quantile of the standard normal distribution for a specific confidence 
level a. We may also compute the p-value of the test via, 





p=2P(Z > |z|). (7.8) 


That is, we consider the observed test statistic, z, and determine the maximal significance level 
for which we would reject Ho. Hence, for a fixed significance level a, if p < a, we reject Ho and 
otherwise not. The calculation of critical values such as 21—a/2 and p-values is typically done via 
software, or more traditionally via statistical tables, which list the area under the normal curve along 
with different quantiles of a standard normal. For example 20,925 = —1.96, or 20.975 = 1.96. For 
a = 0.05, we reject the null hypothesis if z > 1.96 or z < —1.96, otherwise we don’t reject. 


If the null hypothesis is rejected then we conclude the test by stating, “there is sufficient evidence 
to reject the null hypothesis at the 5% significance level". Otherwise we conclude by stating, “there is 
insufficient evidence to reject the null hypothesis at the 596 significance level". Note that if a different 
hypothesis test setup was used, such as then the rejection region would not be symmetric 
as in Figure but rather would cover only one tail of the distribution. This is illustrated in 
Figure [7.2] where 20.95 = 1.645 is used to determine the boundary of the rejection region. In such 
a case, the p-value is calculated via p = P(Z > z). 


In Listing |7.1| we present an example containing two hypothesis tests (using the same data) 
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where the first (mu0A) is rejected and the second (mu0B) is not-rejected. For the mu0A case, the 
test statistic is first calculated via along with the corresponding p-value via [(7.8)| Then the 
HypothesisTests package is used to perform the same hypothesis test for both mu0A and muOB. 
The default test assumes a = 5%. Observe that the p-value in the mu0A case is less than 0.05 and 
hence Hoy is rejected. In comparison, for the mu0B case, the p-value is greater than 0.05 and hence 
Hop is not rejected. 


Listing 7.1: [Inference with single sample, population variance is known 


using CSV, Distributions, HypothesisTests 


data = CSV.read("../data/machinel.csv", header=false) [:,1] 
xBar, n = mean (data), length (data) 

sigma = 1.2 

MUON, miss = 52,2, 53 


testStatistic = ( xBar - mu0A ) / ( sigma/sqrt(n) ) 
pVal = 2x*ccdf (Normal (), abs(testStatistic)) 


testA = OneSampleZTest (xBar, sigma, n, mu0A) 
testB OneSampleZTest (xBar, sigma, n, mu0B) 





pine MAE RESUNES too Urs Oye ey 

println("Manually calculated test statistic: ", testStatistic) 
println("Manually calculated p-value: ", pVal,"\n") 
println(testA) 


println("\n In case of mu0 = ", mu0B, " then p-value = ", pvalue(testB) ) 





Results for mu0 = 52.2: 


Manually calculated test statistic: 2.8182138203055467 
Manually calculated p-value: 0.004829163880878602 


One sample z-test 





Population details: 


parameter of interest: Mean 
value under h, 0: 52.2 
point estimate: 52.95620612127991 


95$ confidence interval: (52.4303, 53.4821) 


Test summary: 
outcome with 95% confidence: reject h 0 





two-sided p-value: 0.0048 
Details: 
number of observations: 20 
z-statistic: 2.8182138203055467 


population standard error: 0.2683281572999747 





In case of mu0 = 53 then p-valu 0.870352975060586 
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In lines 3-6 we load the data, calculate the sample mean, and specify the values of mu0A and mu0B 
under Ho (there are two separate tests in this example). Note that importantly, the standard deviation, 
sigma, is specified as 1.2, as it is ‘known’. In line 8 we calculate the test statistic for case mu0A 
according to In line 9 we calculate the p-value according to Note that the ccaf () 
function is used to find the area to the right of the absolute value of the test statistic. In lines 11 
and 12, OneSampleZTest () from HypothesisTests is used to perform the same calculations 
for both the mu0A and mu0B case. The results are stored in testA and testB. These objects can 
then be printed or queried. Note that OneSampleZTest () was called with 4 arguments. If the last 
argument (mu0A or mu0B) was excluded, then the function would have performed the one sample z 
test assuming uo = 0. There is also an additional method for OneSampleZTest (), which simply 
takes a single argument of an array of values. In this case it will use the sample standard deviation 
as the population standard deviation, and will assume pig = 0. Further information is available in the 
documentation of the HypothesisTests package. Lines 14-17 print the results for the mu0A case. 
The p-value of 0.0048 merits rejection of Ho for a = 5%. The output from line 17 also lists the value 
of the parameter under Ho, the point estimate of the parameter (xBar), as well as the corresponding 
95% confidence interval. Line 19 prints the p-value for the second hypothesis test which uses muOB, 
and since the p-value is greater than 0.05, we do not reject Ho. Note the use of the pvalue () function 
applied to testB. This way of using the HypothesisTests package is based on creating an object 
(testB in this case) and then querying it using a function like pvalue (). 





Population Variance Unknown 


Having covered the case of variance known, we now consider the more realistic scenario where 
the population variance is unknown. Informally called the T-test, this is perhaps the most famous 
and widely used hypothesis test in elementary statistics. Here the test statistic is the T-statistic, 


X — Ho 
qu 
S/ n 


Notice that it is similar to however the sample standard deviation, S, is used instead of the 
population standard deviation c, since o is unknown. As presented in Section [5.2] in the case where 
the data is normally distributed with mean pp, the random variable T follows a T-distribution with 
n, — 1 degrees of freedom and this is the basis for the T-test. The procedure is the same as the Z-test 
presented above, except that a T-distribution is used instead of a normal distribution. Note that 
for non-small n, the T-distribution is almost identical to a standard normal distribution. 


(7.9) 


'The observed test statistic from the data is then, 


t- PT (7.10) 


and the corresponding p-value for a two sided test is 





p = 2P(15  [t|), (7.11) 


where T,-1 is a random variable distributed according to a T-distribution with n — 1 degrees of 
freedom. Note that standardized tables present critical values of the T-distribution, namely ty where 
y is typically 0.9, 0.95, 0.975, 0.99 and 0.995. These are typically presented in detail for degrees of 
freedom ranging from n — 2 to n — 30, after which the T-distribution is very similar to a normal 
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distribution. These values are then compared to the T-statistic ((7.10)| where y = 1 — a in the one 
sided case or y = 1—a//2 in the two sided case. However, for precise calculation of p-values, software 


must be used. 


In Listing [7.2] we first calculate the test statistic via[(7.10)] and then use this to manually calcu- 
late the corresponding p-value via[(7.11)] Then OneSampleTTest () from the HypothesisTests 
package is used to perform the same hypothesis. The output from our manual calculation matches 
that of OneSampleTTest (). 


Listing 7.2: Inference with single sample, population variance unknown 


using CSV, Distributions, HypothesisTests 


data = CSV.read("../data/machinel.csv", header=false) [:,1] 
xBar, n = mean (data), length (data) 

= std(data) 
mu = 52,2 


testetatistic = ( ae — immo ) / ( a/sart (ia) ) 
pVal = 2xccdf (TDist (n-1), abs(testStatistic) ) 


¡AI IIA calculated A a ao) 
println("Manually calculated p-value: ", pVal,"\n") 
OneSampleTTest (data, mu0) 





Manually calculated test statistic: 2.86553950269453 
Manually calculated p-value: 0.009899631865162935 


One sample t-test 





Population details: 


parameter of interest: Mean 
value under h_0: 52.2 
point estimate: 52.95620612127991 


95% confidence interval: (52.4039, 53.5085) 


Test summary: 
outcome with 95% confidence: reject h_0 





two-sided p-value: 0.0099 
Details: 

number of observations: 20 

t-statistic: 2.86553950269453 

degrees of freedom: 19 


empirical standard error: 0.2638965962845154 





Lines 1-9 are similar to Listing [7.1] In line 5 the sample standard deviation is calculated and stored 
as s. In lines 8 and 9 the test statistic and p-value are calculated according to and 
respectively. Here the ccdf() function is used on a T-distribution with n — 1 degrees of freedom, 
TDist (n-1). The manual calculations are output in lines 11 and 12. In line 13 OneSampleTTest () 
is used to perform the same hypothesis test on the data. Note that in this case we only specify two 
arguments, the array of our data, and the value of uo, mul. 





7.1. SINGLE SAMPLE HYPOTHESIS TESTS FOR THE MEAN 261 
A Non-parametric Sign Test 


The validity of the T-test relies heavily on the assumption that the sample X4,..., Xn is com- 
prised of independent normal random variables. This is because only under this assumption does the 
T-statistic follow a T-distribution. This assumption may often be safely made, however in certain 
cases we cannot assume a normal population and we need an alternative test. 


Here we present a particular type of non-parametric test known as the sign test. The phrase 
“non-parametric” implies that the distribution of the test statistic does not depend on any particular 
distributional assumption for the population. 


For the sign test, we begin by denoting the random variables, 


n n 


XtSSCUX > po} ad X` = SO HX: < po} =n - X*. (7.12) 
i=1 i=1 


where 1{-} is the indicator function. The variable X* is a count of the number of observations that 
exceed jig, and similarly, X^ is a count of the number of observations that are below pug. 


Observe that under Ho : u = po, it holds that P(X; > uo) = P(X; < uo) = 1/2. Note that 
here we are actually taking uy as the median of the distribution and assuming that P(X; = uo) = 0 
as is the case for a continuous distribution. Hence under Ho the random variables X* and X^ 
both follow a binomial(n, 1/2) distribution (see Section [3.5]. Given the symmetry of this binomial 
distribution we define the test statistic to be, 


U = max(X*, X7}. (7.13) 
Hence with observed data, and an observed test statistic u, the p-value can be calculated via, 
p=2P(B >u), (7.14) 


where B is a binomial(n, 1/2) random variable. Here, under Ho, p is the probability of getting an 
extreme number of signs greater than u (either too many via X* or a very small number via X7). 
The test procedure is then to reject Ho if p « a. 


In Listing |7.3| we present an example where we calculate the value of the test statistic and its 
corresponding p-value manually. We then use these to make conclusions about the null hypothesis 
at the 5% significance level. As was done in Listing |7.1| we compare two hypothetical cases. In the 
first case uo = 52.2, and the second uo = 53.0. As can be observed from the output, the former case 
is significant (Ho is rejected) while the latter is not, as the test statistic of 11 is not non-plausible 
under Hp. 
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Listing 7.3: Non-parametric sign test 


using CSV, Distributions, HypothesisTests 


data = CSV.read("../data/machinel.csv", header=false) [:,1] 
n = length (data) 
mu0A, mu0B = 52.2, 53 


xPositiveA = sum(data .> mu0A) 
testStatisticA = max(xPositiveA, n-xPositiveA) 


xPositiveB = sum(data .> mu0B) 
testStatisticB = max(xPositiveB, n-xPositiveB) 


binom = Binomial (n,0.5) 
pvalA 2xccdf(binom, testStatisticA) 
pValB Pree (loslincim, test See bacis) 


println("Binomial mean: ", mean(binom)) 


peime la (Uere ie amO "UL sm 
jguesbgae dla (UU Netes sSearlstes Up estt SE 
println("NtP-value: ", pValA) 


println("Case B: mu0: ", mu0B) 
eine dba (" WETGSE sesilsrtes Up tesestrearlsilels) 
joneslione Too (02 Wig) wellness — ls Vea) 








Binomial mean: 10.0 


Case A: mu0: 52.2 
Test statistc: 15 
p-value: 0.011817932128906257 
Case B: mu0: 53 
Test statistc: 11 
p-value: 0.5034446716308596 








In line 5 the value of the population mean under the null hypothesis for both cases, mu0A and mu0B, 
is specified. In lines 7-11 the observed test statistics for both cases are calculated via [[7.13]] Note 
the use of .> for comparing the array data element wise with the scalars mu0A and muOB. In lines 
13-15, Binomial() and ccdf () are used used to compute the p-values for both cases in via[(7.14)] 
The results are printed in lines 19-25. As there are n = 20 observations the binomial mean is 10. 





Sign Test vs. T-Test 


With the sign test presented as a robust alternative to the T-test, one may ask why not simply 
always use the sign test. After all, the validity of the T-test rests on the assumption that X4,..., Xn 
are normally distributed. Otherwise, T of [(7.9)] does not follow a T-distribution, and conclusions 
drawn from the test may be potentially imprecise. 


One answer is due to the statistical powerof the test. As we show in the example below, the 
T-test is a more powerful test than the sign test when the normality assumption holds. That is, 
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for a fixed a, the probability of detecting Hı is higher for the T-test than for the sign test. This 
makes it a more effective test to use, if the data can be assumed normally distributed. The concept 
of power was first introduced in Section [5.6] and is further investigated in Section 


In Listing |7.4| we perform a two-sided hypothesis test for Ho : u = 53 vs. Hı : u % 53 via 
both the T-test and sign test. We consider a range of scenarios where we let the actual u vary over 
(51.0,55.0]. When y = 53, Ho is the case, however all other cases fall in Hj. On a grid of such 
cases we use Monte Carlo to estimate the power of the tests (for c = 1.2). The resulting curves in 
Figure [7.3] show that the T-test is more powerful than the sign test. 


Listing 7.4: [Comparison of sign test and T-test 


using Random, Distributions, HypothesisTests, Plots; pyplot () 


muRange = 51:0.02:55 
= 20 
N = 10%4 
mG = 33.0 
powerT, powerU = [], [] 


for muActual in muRange 
dist = Normal (muActual, 1.2) 


rejectT, rejectU = 0, 0 
Random. seed! (1) 





for nN 
data = rand(dist,n) 
xBar, stdDev = mean(data), std(data) 


tStatT (xBar - mu0)/ (stdDev/sqrt (n) ) 
pValT 2xccdf (TDist (n-1), abs (tStatT) ) 


xPositive = sum(data .> mu0) 
uStat = max(xPositive, n-xPositive) 
pValSign = 2xcdf (Binomial (n, 0.5), uStat) 


rejectT += pValT < 0.05 
rejectU += pValSign < 0.05 


end 


push! (powerT, rejectT/N) 
push! (powerU, rejectU/N) 


end 





35 plot (muRange, powerT, c=:blue, label="t test") 

36 plot! (muRange, powerU, c=:red, label="Sign test", 

37 xlims=(51,55), ylims=(0,1), 

38 xlabel="Different values of muActual", 

39 ylabel="Proportion of times HO rejected", legend=:bottomleft) 
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Proportion of times HO rejected 








— ttest 
— Sign test 
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Different values of muActual 


Figure 7.3: Power of the T-test vs. power of the sign-test. 





In lines 3-6 we setup the basic parameters. The sample size n is 20. The range muRange represents 
the range [51.0, 55.0] in discrete steps of 0.02. The number N is the number of simulation repetitions 
to carry out for each value of y in that range. The value mu0 is the value under Ho. In line 7 we 
initialize empty arrays that are to be populated with power estimates. Lines 9-33 contain the main 
loop where each discrete step in muRange is tested. In line 11 for each value a distribution dist is 
created with the same standard deviation, 1.2, but with a different mean, muActual. The inner loop 
of lines 15-28 is a repetition of the sampling experiment N times where for the same data we compute 
tStat, uStat, and the corresponding p-values. Rejection counts are accumulated in lines 26 and 27, 
where pValT < 0.05 and pValSign < 0.05 constitutes a rejection. Note Random. seed! () is 
used in Line 13 so to obtain a smoother curve via common random numbers. See Section [10.6] Lines 
30-31 record the power for the respective muActual by appending to the arrays using push! (). 
Lines 35-49 plot the power curves showing the superiority of the T-test. Observe that at u = 53 the 
power is identical to a = 0.05. 





7.2 Two Sample Hypothesis Tests for Comparing Means 


Having dealt with several examples involving one population, we now present some common hy- 
pothesis tests for the inference on the difference in means of two populations. As with all hypothesis 
tests we start by first establishing the testing methodology. Commonly we wish to investigate if 
the population difference, Ao, takes on a specific value. Hence we may wish to set up a two sided 
hypothesis test as, 


Ho : pı — p2 = Ao, Hi: — ua £z Ao. (7.15) 


Alternatively, one could formulate a one sided hypothesis test, such as, 
Ho : pa — a € Ao, Hi: qa — p2> Ao, (7.16) 


or the reverse if desired. It is common to consider Ag = 0, in which case |(7.15)| would be stated 
as Ho : 1 = u2, and similarly [(7.16)] as Ho : uı < pa. Once the testing methodology has been 
established, the approach then follows the same outline as that covered previously, the test statistic 
is calculated along with its corresponding p-value, which is then used to make some conclusion 
about the hypothesis for some significance level a. 
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For the tests introduced in this section we assume that the observations x is AD from 


population 1 and x, 2s ,xQ from population 2 are all normally distributed, where X G ) has 


mean uj and variance 0j. The testing methodology then differs based on the following three cases, 


(I) The population variances c; and o» are known. 
(II) The population variances cı and c2 are unknown and assumed equal. 


(III) The population variances c; and 02 are unknown, and not assumed equal. 


In each of these cases, the test statistic is given by, 


Xi — Xə = Ao 
==. 7.17 
Serr ) 
where X j is the sample mean of x8 ) ot ae, and the standard error Serr varies according to the 


case (I-III). In each example, the two datasets machinel.csv and machine2.csv are used, and 


it is considered that Ho implies that both machines are identical (i.e. we use |({7.15)} with Ao = 0). 


Population Variances Known 


In case (I), where the population variances o? and o2 are known, we set, 


2 2 
c c 
Serr — 4| — +. 
n1 na 
In this case, Serr is not a random quantity, and hence the test statistic |(7.17)| follows a standard 
normal distribution under Ho. This is due to the distribution of X j as described in Section!5.2] For 
this case, the observed test statistic is, 
7T¡—T2)-A 
y = 41-22) Ao (7.18) 
2 2 
c c 
T 9 9s 
ni na 


At this point, z is used for hypothesis tests in a manner analogous to the single sample test for the 
population mean when the variance known, as described at the start of Section 


Note that in reality it is highly unlikely that both the population variances would be known, 
hence the HypothesisTests package does not contain functionality for this test. Nevertheless, 
in Listing [7.5] we perform this hypothesis test manually for pedagogical completeness. The output 
shows there is a very significant difference between the machines with the p-value almost zero. 
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Listing 7.5: Inference on difference of two means with variances known 


using CSV, Distributions, HypothesisTests 


datal = CSV.read("../data/machinel.csv", header=false) [:,1] 
data2 = CSV.read("../data/machine2.csv", header=false) [:,1] 
xBarl, nl = mean(datal), length(datal) 

xBar2, n2 = mean (data2), length (data2) 

sigl, sig? = 1.2, 1.6 

delta0 = 0 


testStatistic = | serilowBeir2 — dalta ) / ( sert sixigl^2 / ml + gug2^2 / mz ) ) 
pVal = 2xccdf (Normal (), abs(testStatistic) ) 


println("Sample mean machine 1: ",xBarl) 

println("Sample mean machine 2: ",xBar2) 

println("Manually calculated test statistc: ", testStatistic) 
println("Manually calculated p-value: ", pVal) 





Sample mean machine 1: 52.95620612127991 

Sample mean machine 2: 50.94739671468099 

Manually calculated test statistc: 4.340167327618076 
Manually calculated p-value: 1.423742605667141e-5 





In lines 3-7 we load our data, calculate the sample means, and specify the values of the population 
variances. In line 8 we specify the value of our test parameter under the null hypothesis, delta0, as 
0. In line 10 we calculate the test statistic via |(7.18)| and in line 11 we calculate the p-value. 





Population Variances Unknown and Assumed Equal 


We now consider case (II) where the population variances are unknown, and assumed equal 


(0? := o? = o2). In this case the pooled sample variance, Bs , is used to estimate o? based on both 


samples. As covered in Section it is given by, 


nı — DS + (na x 18 


g2 
p ni +n2 — 2 


l (7.19) 





where S; is the sample variance of sample 7. It can be shown that under Ho, if we set, 


1 1 
Serr = Spa — + —, 
n1 Na 
the test statistic is distributed according to a T-distribution with nı + n2 — 2 degrees of freedom. 
For this case the observed test statistic is, 


=m)=—A 
pui men (7.20) 

Lr 

5p n1 na 


where sp is the observed pooled sample variance. At this point the procedure follows similar lines 
to the single sample T-test described in the previous section. Note that this two sample T-test with 
equal variance is one of the most commonly used tests in statistics. 
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In Listing |7.6] we present an example where we perform a two sided hypothesis test on the 
difference in means of pipes produced from machines 1 and 2. First the test is performed man- 
ually, and then the EqualVarianceTTest () function from the HypothesisTests package 
is used. The resulting output shows that the manually calculated values match those given by 
EqualVarianceTTest (). 


Listing 7.6: Inference on difference of means, variances unknown, assumed equal 


using CSV, Distributions, HypothesisTests 


datal = CSV.read("../data/machinel.csv", header=false) [:,1] 
data2 = CSV.read("../data/machine2.csv", header=false) [:,1] 
xBarl, sl, nl = mean(datal), std(datal), length(datal 
xBar2, s2, n2 = mean(data2), std(data2), length (data2 
delta0 = 0 


= sec ( Gall=l)esi*2 + (@2—l)jes2*2 ) / (mil + m2 = 2) J 
testStatistio = ( SAB =- deltan ) / ( Se e sorell imi + i/m2) ) 
pvel = 2xccdf(TDist(nl+n2 -2), abs(testStatistic) ) 


oume la Manva lly calculated! este seemiselos Y, esisSieele sis) 
println("Manually calculated p-value: ", pVal, "\n") 
printin(EqualVarianceTTest (datal, data2, delta0)) 








Manually calculated test statistic: 4.5466542394674425 
Manually calculated p-value: 5.9493058655043084e-5 


Two sample t-test (equal variance) 





Population details: 


parameter of interest: Mean difference 
value under h_0: 0 
point estimate: 2.008809406598921 


95% confidence interval: (1.1128, 2.9049) 


Test summary: 
outcome with 95% confidence: reject h_0 





two-sided p-value: <le-4 

Details: 
number of observations: [20,18] 
t-statistic: 4.5466542394674425 
degrees of freedom: 36 


empirical standard error: 0.44182145832893077 
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Lines 3-7 are similar to Listing [7.5] however note the calculations of the sample standard deviations 
s1 and s2. In line 7 we specify the value of our test parameter under the null hypothesis as 0. Line 9 
calculates the square root of the pooled sample variance, sP, via [[7.19]] Note the use of sqrt (), as 
the test statistic (7.20) makes use of sp not s2. Lines 10 and 11 calculate the test statistic via [f120)) 
and the corresponding p-value respectively. These are printed as output in lines 13 and 14. Line 15 
uses the EqualVarianceTTest () function to perform the hypothesis test. Note three arguments 
are given, the two arrays datal and data2, and the value of the test parameter under Ho. Note the 
function defaults to Ay = 0 if only two arguments are given, but here we demonstrate the general use 
of the function. 








Population Variances Unknown, and not Assumed Equal 


In case (III) where the population variances are unknown and not assumed equal (a? 4 03), 
we set, 


pro 45 
Serr 2 4| — +=. 
nı na 
Then the observed test statistic is given by 
T|—29)— A 
t= (zi — 92) -Ao (7.21) 
GONE 
n1 na 


As covered in Section 5.2] the distribution of the test statistic does not follow an exact T-distribution. 
Instead, we use the Satterthwaite approximation, and determine the degrees of freedom via, 


2 2 2 
Sí N+ 85 1 
ja ELI. (7.22) 


(Bim)? (Bina)? 


nı— 1 na — 1 








In Listing [7.7] we perform a two sided hypothesis test that the difference between the population 
means is zero (Ag = 0). We first manually calculate the test statistic and p-value, and then make use 
of the UnequalVarianceTTest () function from the HypothesisTests package. The output 
shows both methods are in agreement. 
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Listing 7.7: Inference on difference of means, variances unknown, not assumed equal 


using CSV, Distributions, HypothesisTests 


datal = CSV.read("../data/machinel.csv", header=false) [:,1] 
data2 = CSV.read("../data/machine2.csv", header=false) [:,1] 
xBarl, sl, nl = mean(datal), std(datal), length(datal 
xBar2, s2, n2 mean (data2), std(data2), length (data2 
delta0 = 0 


w = ( SI*2/mil sx S2°2/m2 2 / ESE SN ON (S2°2/in2)*2/ aL) 
testStatistic = | zBarl-zBar2 = deltat ) / seme( SI°2/inil s S2°2/in2 ) 
¡Weil = Zee (Dist (ww), alos Mrest State SE) ) 


println("Manually calculated degrees of freedom, v: ", v) 
Pent lin (MUManvellywecalcullared teet star sto Ms Crit Sect tes n) 
println("Manually calculated p-value: ", pVal, "\n") 
println(UnequalVarianceTTest (datal, data2, delta0) ) 





Manually calculated degrees of freedom, v: 31.82453144280283 
Manually calculated test statistic: 4.483705005611673 
Manually calculated p-value: 8.936189820683007e-5 


Two sample t-test (unequal variance) 





Population details: 


parameter of interest: Mean difference 
value under h_0: 0 
point estimate: 2.008809406598921 


95% confidence interval: (1.096, 2.9216) 


Test summary: 
outcome with 95% confidence: reject h_0 





two-sided p-value: <le-4 

Details: 
number of observations: [20,18] 
t-statistzoz 4.483705005611673 
degrees of freedom: 31.82453144280282 


empirical standard error: 0.4480244360600785 





Lines 3-8 are similar to Listing In line 9 the degrees of freedom, v, is calculated via |(7.22)| In 
line 10 the test statistic is calcualted via[[7.21]] In line 11 these are both used to calcualte the p-value. 
Lines 13-16 outputs the degrees of freedom, test statistic, and p-value calculated. Line 16 uses the 
UnequalVarianceTTest () function to perform the hypothesis test. 
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7.3 Analysis of Variance (ANOVA) 


The methods presented in Section handle the problem of comparing means of two popu- 
lations. However, what if there are more than two populations that need to be compared? This 
is often the case in biological, agricultural, and medical trials, among other fields, where it is of 
interest to see if various ‘treatments’, also known as ‘groups’, have an effect on some mean value or 
not. In these cases each type of treatment is considered as a different population. 


More formally, assume that there is some overall mean u and there are L > 2 treatments, where 
each treatment may potentially alter the mean by 7;. In this case, the mean of the population under 
treatment ¿ can be represented by ui = u + Ti, with y an overall mean and, 


L 
> Tj = 0. 
i=l 
This condition on the parameters 7;,...,77, ensures that given 41,..., HL, the overall mean y and 


Ti,...,TL are well defined. 


The question is then: Do the treatments have any effect or not? Such a question is presented 
via the hypothesis formulation: 











Hg:Tij— 79 — ...— Tp, — 0, vs. Hisda | ms. (7.23) 


Note that Ho is equivalent to the statement that 11 = u2 =... = uL, indicating that the treatments 
do not have an effect. Furthermore, Hı stating that there exists an i with 7; Æ 0, is equivalent to 
the case where there exist at least two treatments, i and j such that u; 4 uj. In other words this 
means that the choice of treatment has an effect, at least between some treatments. 


In conducting hypotheses tests such as ((7.23)| we collect observations (data) as follows, 


Treatment 1: 211, $12,..., Tin, 
Treatment 2: £21, 2223,..., Tang, 
Treatment L: zi, Xp2,..., Linz; 
where n4,n2,...,ny, are the sample sizes for each of the treatments (or groups). If all samples are 


the same size (say n; — n for all i) then this is called a balanced design problem. However, often 
different treatments have different sample sizes. It is also convenient to denote the total number of 
observations via, 


L 
m= y Ni. (7.24) 
i=1 
Note that in a balanced design we have m = Ln. 


We focus on an example for three treatments where data from machinel.csv,machine2.csv, 
and machine3.csv represent sample measurements of the diameter of pipes produced by three 
different machines. Here each machine is a different treatment. The diameters of the pipes vary due 
to imprecision of machines but also potentially (if Ho does not hold) due to variability between the 
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machines. A box plot of this dataset was already presented in Figure [4.9] of Chapter 4|. Looking at 
that plot, while there are differences between the groups, it isn’t immediately clear if the machine 
affects the pipe diameter or if the differences are just due to noise in manufacturing within each 
machine. 


Once the data is collected, in addition to displaying a box-plot as in Figure [4.9] we also consider 
the values of the sample means for each individual treatment, 


ii 
ti = — Tij. 
ni 25 3 

j=1 


These values can then be compared with the overall sample mean, 
1 L Ni 
rad 2? — 
C= m > , Tij = > T Ti. (7.25) 


In Listing the sample means are computed for three datasets of pipe diameters. Note the 
use of the broadcast operator in lines 4 and 7. 


Listing 7.8: Sample means for ANOVA 


using €SV, Statistics 








rfile(name) = CSV.read(name, header=false) [:,1] 
data = rfile.(["../data/machinel.csv", 
"data Machune2 can, P 
We datay machines es vip) 
println("Sample means for each treatment: ",round. (mean. (data), ,digits=2)) 
println("Overall sample mean: ",round (mean (vcat (data...)),digits=2) ) 





Sample means for each treatment: [52.96, 50.95, 51.43] 
Overall sample mean: 51.82 


Even though the sample mean values are not exactly the same (they are at 52.96, 50.95 and 
51.43), without looking at variability we can't conclude if the machine type (treatment) affects the 
diameter size or not. Here is where ANOVA comes into play. 


The typical way to establish whether or not an effect between the treatments exists is to examine 
the variability of the individual treatment means, and compare these to the overall variability of 
the observations. If the variability of means significantly exceeds the variability of the individual 
observations, then Ho is rejected, otherwise it is not. Such an approach is called ANOVA, which 
stands for analysis of variance, and it is based on the decomposition of the sum of squares. In fact 
ANOVA is a broad collection of statistical methods, and here we only provide an introduction to 
ANOVA by covering the one-way ANOVA test. In this test, the statistical model assumes that the 
observations of each treatment group come from an underlying model of the following form, 


Xj, = Mi =p+T+€ where e ~ N(0, 07), (7.26) 


where X; is the model for the ith treatment group and e is some noise term with common un- 
known variance across all treatment groups, independent across measurements. In this sense, the 
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ANOVA model |(7.26)| generalizes the assumptions of the T-test applied to case II (comparison of 
two population means with variance unknown and assumed equal), as presented in the previous 
section. 


The process of conducting a one-way ANOVA test follows the same general approach as any other 
hypothesis test. First the test statistic is calculated, then the corresponding p-value is obtained, 
and finally the p-value is used to make some conclusion about whether or not to reject Ho at some 
chosen confidence level a. The test statistic for ANOVA, known as F-statistic, is the ratio of the 
average variance between the groups, divided by the average variance within the groups. Under 
the null hypothesis, the F-value is distributed according to the F-distribution, covered at the end of 
Section [5.2] Hence the ANOVA test is sometimes referred to as the F-test. 


Before we present the details of carrying out an F-test, we present the mathematical motivation 
used to calculate the variability within the groups (or treatments) and the variability between the 
groups. 


Decomposing Sum of Squares 


A key idea of ANOVA is the decomposition of the total variability into two components: the 
variability between the treatments, and the variability within the treatments. There are explicit 
expressions for both, and here we show how to derive them by performing what is known as the 
decomposition of the sum of squares. 


The total sum of squares, also known as the sum of squares total (S STota1), is a measure of the 
total variability of all observations, and is calculated as follows, 


L ni 
SSTotal = y y (zij = z)’, (7.27) 


i=1 j=1 


where z is given by|(7.25)| Now through algebraic manipulation (adding and subtracting treatment 
means) we can show that SSTota] can be decomposed as follows, 


ace) (ey nm. 








i=1 j=l i=1 j=1 
L mj 
= 2; > (E zm) 2 ng Ti) (zi — T) + (zi — i) 
i=1 j=1 
L mi g L ] 
= 25 2. (zi — zi) + 2. ni(zi—E) . (7.28) 
i=1 j=l i=1 
SSError SS Treatment 


Note that on the second line, the middle term reduces to zero, since jar (ag — ij) = 0. Hence we 
have shown that the total sum of squares, SSTota1, can be decomposed to, 


S S'Total = S SError F SO Treatment: (7.29) 
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Note that the sum of squares error, SSError, is also known as the sum of the variability within 
the groups, and that the sum of squares Treatment, SStyeatment, 18 also known as the variability 
between the groups. The decomposition [(7.29)] holds under both Hg and Hj, and hence allows us 
to construct a test statistic. Intuitively, under Ho, both SSError and SStyeatment Should contribute 
to SSTota] in the same manner (once properly normalized). Alternatively, under Hj it is expected 
that SStyeatment Would contribute more heavily to the total variability. 


Before proceeding with the construction of a test statistic, we present Listing |7.9} where the 
decomposition of [(7.29)] is demonstrated for the purpose of showing how to compute its individual 
components in Julia. Note that this verification of the decomposition is not something one would 
normally carry out in practice as it is already proven in [(7.28)] 


Listing 7.9: Decomposing the sum of squares 


using Random, Statistics 
Random. seed! (1) 
allData = [rand(24), rand(15), rand(73)] 


xBarArray = mean. (allData) 

nArray = length. (allData) 
xBarTotal = mean(vcat (allData...)) 
L = length (nArray) 


1 
2 
3 
4 
5 
6 
Y 
8 


ssBetween=sum([nArray[i]*(xBarArray[i] - xBarTotal)^2 for i in 1:1]) 
ssWithin-sum([sum([(ob - xBarArray[i])^2 for ob in allData[i]]) for i in 1:1]) 
ssTotal=sum([sum([ (ob = xBarTotal)^2 for ob in allData[i]]) for i in 1:1]) 


println("Sum of squares between groups: ", ssBetween) 
println("Sum of squares within groups: ", ssWithin) 
prine Aoun One Seles totall ied) 





Sum of squares between groups: 0.2941847110381936 
Sum of squares within groups: 8.50335257006105 
Sum of squares total: 8.797537281099242 








The data is generated in line 3 in an array of arrays, allData. In line 5 the mean of each treatment 
is calculated via the mean () function with the broadcast operator ’.’, which performs the operation 
over each of the three elements of allData. In line 6 we retrieve the length of each array via the 
length() function and the broadcast operator. In line 7 the point estimate for the total population 
mean is calculated and stored as xBarTotal. Note that in contrast to line 5, here we first vertically 
concatenate all the groups into a single array. This is done via the vcat () function, and the splat 
operator .... In line 12, the number of treatments is stored as L. In line 10, SStycatment is calculated. 
A comprehension is used, and the point estimate of the population mean xbarTotal is subtracted 
from the ith element of each array, and the results squared. These are each multiplied by the length 
of their respective arrays, and the results for each of the arrays summed together, and stored as 
ssBetween. Note that ‘between’ is sometimes used as an alternative name to ‘treatments’. In 
line 11, S'SError is calculated. The inner comprehension is used to square the difference between each 
observation, ob, and the group mean xBarArray[i]. The outer comprehension is used to repeat 
this process from the 1:L th group. The results for all groups are summed. In line 12, SSTota] is 
calculated via [[7.27]] The difference between each observation, ob, and the point estimate for the 
population mean, xBarTotal, is calculated and each result squared. This is first performed for the 
i'th array, in the inner comprehension, and then repeated for all arrays via the outer comprehension. 
Finally all the squares are summed, via the outer sum () function. 
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Carrying out ANOVA 


Having understood the sum of squares decomposition we now present the F-statistic of ANOVA: 


SS treatment! (L > 1) 
SStrror/(m — L) l 


It is a ratio of the two sum of squares components of|(7.29)| normalized by their respective degrees 
of freedom, L — 1 and m — L. These normalized quantities are respectively denoted by M Styeatment 
and M Sfrror standing for ‘Mean Squared’. Hence F = M Styeatment /M SError- 


F- (7.30) 





Under Hp and with the model assumptions presented in the ratio F follows an F- 
distribution (first introduced in Section with L — 1 degrees of freedom for the numerator 
and m — L degrees of freedom for the denominator. Intuitively, under Ho we expect the numerator 
and denominator to have similar values, and hence expect F to be around 1 (indeed most of the 
mass of F distributions is concentrated around 1). However, if MStyeatment iS significantly larger, 
then it indicates that Hg may not hold. Hence the approach of the F-test is to reject Ho if the 
F-statistic is greater than the 1 — a quantile of the respective F-distribution. Similarly, the p-value 
for an observed F-statistic fo, is given by, 


p= PFr. dat > fo): 


where FT. 4,4 ;, is an F-distributed random variable with L — 1 numerator degrees of freedom and 
m — L denominator degrees of freedom. 


It is often customary to summarize both the intermediate and final results of an ANOVA F-test in 
an ANOVA table as shown in Table[7.1| below, where ‘T’ and “E” are shorthand for ‘Treatments’ and 
‘Error’ respectively. Such tables also generalize to more complex ANOVA procedures not covered 
here. 

















Source of variance: DOF: | Sum of sq’s: | Mean sum of sq’s: | F-value: 
SS MS 
Treatments (between treatments) | L— 1 SST MS, = T E M a 
— SS 
Error (within treatments) m-—L SSE MSg- T 
TIU — 
Total m-— 1 SSTotal 























Table 7.1: A one-way ANOVA table. 


We now return to the three machines example and carry out a one-way ANOVA F-test. This 
is carried out in Listing [7.10] where we implement two alternative functions for ANOVA. The first 
function, manualANOVA (), extends the sum of squares code presented in Listing [7.9] above. The 
second function, g1mANOVA (), utilizes the GLM package that is described in detail in Chapter 
Note that GLM requires the DataFrames package. Both implementations yield identical results, 
returning a tuple of the F-statistic and the associated p-value. In this example, the p-value is very 
small and hence under any reasonable o we would reject Ho and conclude that there is sufficient 
evidence that the diameter of the pipe depends on the type of machine used. Related is Listing [1.18] 
in Chapter [1] where we carry out ANOVA for the same data using the R language. 
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Listing 7.10: [Executing one-way ANOVA 


using GLM, Distributions, DataFrames 


datal = parse. (Float64, readlines("../data/machinel.csv") ) 
data2 parse. (Float64, readlines("../data/machine2.csv") ) 
data3 parse. (Float64, readlines("../data/machine3.csv") ) 


function manualANOVA (allData) 
nArray = length. (allData) 
d = length (nArray) 


xBarTotal = mean(vcat(allData...)) 
xBarArray mean. (allData) 


ssBetween sum at Y [Lit] Bara ray A A So SECO E al iy Joel] ) 
ssWithin = sum([sum([(ob - xBarArray[i])^2 for ob in allData[i]]) 

for i Ene d 

dfBetween = d-1 

dfError = sum(nArray)-d 





msBetween = ssBetween/dfBetween 
msError = ssWithin/dfError 
fStat = msBetween/msError 
pval = ccdf (FDist (dfBetween,dfError),fStat) 
return (fStat,pval) 
end 

















function glmANOVA (allData) 
nArray = length. (allData) 
d = length (nArray) 


treatment = vcat([fill(k,nArray[k]) for k in 1:d]...) 
response - vcat(allData...) 
dataFrame = DataFrame(Response-response, Treatment=categorical (treatment) ) 
modelHO = lIm(@formula(Response ~ 1), dataFrame) 
modelHla = 1m (@formula (Response ~ 1 + Treatment), dataFrame) 
res = ftest (modelHla.model, modelH0.model) 
(res.fstat[1],res.pval[1]) 
end 





println("Manual ANOVA: ", manualANOVA([datal, data2, data3])) 
println("GLM ANOVA: ", glmANOVA([datal, data2, data3])) 





Manual ANOVA: (10.516968568709117, 0.00014236168817139249) 
GLM ANOVA: (10.516968568708988, 0.0001) 
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0.8 








— Equal group means case 
— Unequal group means case 
— F-distribution analytic 

--- Critical value boundary 








F-value 


Figure 7.4: Histograms of the F-statistic for the case of equal group means 
(with analytic F-distribution), and not all equal group means. 





In lines 7-25 the function manualANOVA () is implemented, which calculates the sum of squares in 
the same manner as in Listing|7.9| The sums of squares are normalized by their corresponding degrees 
of freedom dfBetween and dfError, and then in line 22 the F-statistic £Stat is calculated. The 
p-value is then calculated in line 23 via the ccdf () function and the F-distribution FDist () with 
the degrees of freedom calculated above. The function returns a tuple of values, comprising the F- 
statistic and the corresponding p-value. In lines 27-38 the function glmANOVA() is defined. This 
function calculates the F-statistic and p-value, via functionality of the GLM package which is heavily 
discussed in Chapter [8] In lines 30-32 a DataFrame (see Chapter [4) is set-up in the manner required 
by the GLM package. Then in lines 34-35 two ‘model objects’ are created via the 1m() function from 
the GLM package. Note that mode1H0 is constructed on the assumption that the machine type has 
no effect on the response, while mode1H1 is constructed on the assumption that treatment has an 
effect. Finally, the ftest () function from the GLM package is used to compare if modelH1a fits the 
data ‘better’ than mode1H0. Also note that the model fields of the model objects are used. Finally, 
the F-statistic and p-value are returned in line 37. The results of both functions are printed in lines 
40-41 and it can be observed that the F-statistics and p-values calculated are identical to within the 
numerical error expected due to the different implementations. 








More on the Distribution of the F-Statistic 


Having explored the basics of ANOVA, we now use Monte Carlo simulation to illustrate that 
under Ho the F-statistic is indeed distributed according to the F-distribution. In Listing[7.11]|below, 
we present an example where Monte Carlo simulation is used to empirically generate the distribution 
of the F-statistic for two different cases where the number of groups is L = 5. In the first case, the 
means of each group are all the same and are at 13.4, but in the second case, the means are not all 
the same. The first case represents Ho, while the latter is one possibility within Hj. For both cases 
the standard deviation of each group is identical (2). 


In this example, for each of the two cases, N sample runs are generated, where each run consists 
of a separate random collection of sample observations for each group. Hence by using a large number 
of sample runs N, histograms can be used to empirically represent the theoretical distributions of 
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the F-statistics for both cases. The results presented in Figure show that the distribution 
of the F-statistics for the equal group means case is in agreement with the analytically expected 
F-distribution, while the F-statistic for the case of unequal group means is not. The figure also 
illustrates the critical value for rejection with a = 0.05. The area under the red curve to the left of 
that boundary is the power of the test under the specific point in H; that is simulated. 


Listing 7.11: |Monte Carlo based distributions of the ANOVA F-statistic 


using Distributions, Plots; pyplot() 


function anovaFStat (allData) 
xBarArray = mean. (allData) 
nArray = length. (allData) 
xBarTotal = mean(vcat(allData...)) 
L = length (nArray) 


ssBetween = sum( [nArray[i]*(xBarArray[i] - xBarTotal)^2 for i in 1:1] ) 
ssWithin = sum([sum([(ob - xBarArray[i])^2 for ob in allData[i]]) 
for i in s 
return (ssBetween/ (L-1))/(ssWithin/ (sum(nArray)-L)) 
end 





caser = IIS Ar IBA SS TTA] 
(eem UA A ES 12.7, 302-9 
SAD = A 25 E By il 

mundos = [224 ib, Is, 29. 9] 

L = length(casel) 


ISS 


mcFstatsHO = Array{Float64} (undef, N) 
for in IN 
mcFstatsHO[i] = anovaFStat([ rand(Normal (casel[j],stdDevs[j]),numObs[j]) 
for j an c 
end 


mcFstatsH1 = Array{Float64} (undef, N) 
for in IN 
mcFstatsHl[i] = anovaFStat([ rand(Normal (case2[j],stdDevs[j]),numObs[j]) 
for nb jf) 
end 


stephist (mcFstatsHO0, bins=100, 

c=:blue, normed-true, label-"Equal group means case") 
stephist! (mcFstatsH1, bins=100, 

c=:red, normed=true, label="Unequal group means case") 








dfBetween = L - 1 
dfError = sum(numObs) - 
xGrid = 0:0.01:10 
plot! (xGrid, pdf. (FDist (dfBetween, dfError),xGrid), 
c=:black, label="F-statistic analytic") 
critVal = quantile (FDist (dfBetween, dfError),0.95) 
plot! (l[cezrieval, cxlevedl,10,0.8l, 
c=:black, ls=:dash, label="Critical value boundary", 
xlims=(0,10), ylims=(0,0.8), xlabel="F-value", ylabel="Density") 
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In lines 3-13 we create the function anovaFStat () , which takes an array of arrays as input, calculates 
the sums of squares and mean sums of squares as per Table and returns the F-statistic of the 
data. It is similar to Listing [7.9] In lines 15 and 16 we create two arrays where casel represents an 
array of means for the case of all means being equal, and case2, represents an array of means for 
the case of all means not equal. In line 17 we crate the array of group standard deviations, st dDevs. 
Note that in both cases of equal group means and unequal group means, the standard deviations 
of all the groups are equal as per the model assumption. In line 18 we create the array numObs, 
where each element represents the number of observations of the i*” group, or level. In line 21 we 
specify the total number of Monte Carlo runs to be performed, N. In line 23 we preallocate the array 
mcFStatsHO0, which will store N Monte Carlo generated F-statistics, for the case of all group means 
equal. In lines 24-27, we use a loop to generate N F-statistics via the anovaFStat () function defined 
earlier. We use the rand() and Normal() functions within a comprehension to generate data for 
each of the sample groups, using the group means, standard deviations, and number of observations. 
'The comprehension generates an array of arrays, where each of the five elements of the outermost 
array is another array containing the observations for that group, 1 to5. This array of arrays is then 
used as the argument for anovaFStat (), which carries out a one-way ANOVA test on the data and 
outputs the corresponding F-value. Lines 29-33 are similar, using the case2 means. In lines 35-38 
histograms of the F-statistics are generated. In lines 40-41 the degrees of freedom of the treatments 
dfBetween, and the degrees of freedom of the error dfError are calculated. In lines 43-44 the 
analytic PDF of the is plotted. In line 45 the quantile() function is used to calculate the 95'^ 
quantile of the F-distribution, FDist () and this is then used in lines 46-48 to plot the critical value. 








Extensions 


We have only touched on the very basics of ANOVA via the one-way ANOVA case. This stands 
at the basis of experimental design. However, there are many more aspects to ANOVA, and related 
ideas that one can explore. These include, but are not limited to: 


e Extensions to two-way ANOVA where there are two treatment categories, for example ‘machine 
type' and 'type of lubricant used in the machine', each having multiple treatments. 


e Higher dimensional extensions, which are often considered in block factorial design. 


e Comparison of individual factors to determine which specific treatments have an effect and in 
which way. 


e Using ANOVA for longitudinal data analysis using repeated measures. 


e Aspects of optimal experimental design. 


These and many more aspects can be found in design and analysis of experiment texts such 
as [M17]. At the time of writing, many such procedures are not implemented directly in Julia. 
However, one alternative is the R software package, which contains many different implementations 
of these ANOVA extensions, among others. One can call these R packages directly from Julia as in 
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7.4 Independence and Goodness of Fit 


We now consider a different group of hypothesis tests and associated procedures that deal with 
checking for independence and more generally checking goodness of fit. One question often posed is: 
Does the population follow a specific distributional form? We may hypothesize that the distribution 
is normal, exponential, Poisson, or that it follows any other form (see Chapter [3] for an extensive 
survey of probability distributions). Checking such a hypothesis is loosely called goodness of fit. 
Furthermore, in the case of observations over multiple dimensions, we may hypothesize that the 
different dimensions are independent. Checking for such independence is similar to the goodness of 


fit check. 


In order to test for goodness of fit against some hypothesized distribution Fo, we setup the 
hypothesis test as, 
Ho: X ~ Fo, vs. H; : otherwise. (7.31) 


Here X denotes an arbitrary random variable from the population. In this case, we consider the 
parameter space associated with the test as the space of all probability distributions. The hypothesis 
formulation then partitions this space into (Fo) (for Ho) and all other distributions in Hj. 


For the independence case, assume for simplicity that X is a vector of two random variables, 
say X = (X1, X2). Then for this case the hypothesis test setup would be, 


Ho: X independent of Xo, vs. H,: X not independent of X». (7.32) 


This sets the space of Ho as the space of all distributions of independent random variable pairs, and 
H; as the complement. 


To handle hypotheses such as and [(7.32)] we introduce two different test procedures, the 
Chi-squared test and the Kolmogorov-Smirnov test. The Chi-squared test is used for goodness of fit 
of discrete distributions and for checking independence, while the Kolmogorov-Smirnov test is used 
for goodness of fit for arbitrary distributions based on the empirical cumulative distribution function. 
Before we dive into the individual test examples, we explain how to construct the corresponding 
test statistics. 


In the Chi-squared case, the approach involves looking at counts of observations that match 
disjoint categories i = 1,..., M. For each category i, we denote O; as the number of observations 
that match that category. In addition, for each category there is also an expected number of 
observations under Ho, which we denote as E;. With these, one can express the test statistic as, 


M 


vex Se, (7.33) 
i=l * 


Notice that under Ho of|(7.31)| we expect that for each category i, both O; and E; will be relatively 
close, and hence it is expected that the sum of relative squared differences, x?, will not be too big. 
Conversely, a large value of Y? may indicate that Hp is not plausible. Later in this section, we 


show how to use x? to construct the test to check for both goodness of fit (7.31) and to check for 


independence [((7.32)| 


In the case of Kolmogorov-Smirnov, a key aspect is the empirical cumulative distribution function 
(ECDF), which was introduced in Section [4.3] Recall that for a sample of observations, 11,..., Un 
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the ECDF is, 
1 

F(x) == > lig; < z}, where 1{-} is the indicator function. (7.34) 
n 


The approach of Kolmogorov-Smirnov test is to check the closeness of the ECDF to the CDF 
hypothesized under Ho in|(7.31)| This is done via the Kolmogorov-Smirnov statistic, 


$- sup |F(2) — Fo(z)], (7.35) 


where Fo(-) is the CDF under Ay and sup is the supremum over all possible x values. Similar to 
the case of Chi-squared, under Hp it is expected that F(-) does not deviate greatly from Fo(-), and 
hence it is expected that S is not very large. 


The key to both the Chi-Squared and Kolmogorov-Smirnov tests is that under Ho there are 
tractable known approximations to the distribution of the test statistics of both |(7.33)| and [(7.35)). 


These approximations allow us to obtain an approximate p-value in the standard way via, 


where W denotes a random variable distributed according to the approximate distribution and u is 


the observed test statistic of either ((7.33)| or |(7.35)| We now elaborate on the details. 


Chi-squared Test for Goodness of Fit 


Consider the hypothesis and assume that the distribution P$ can be partitioned into 
categories i = 1,..., M. Such a partition naturally occurs when the distribution is discrete with a 
finite number of outcomes. It can also be artificially introduced in other cases. With such a partition, 
having n sample observations, we denote by E; the expected number of observations satisfying 
category i. These values are theoretically computed. Then, based on observations z1,..., £n, we 
denote by O; as the number of observations that satisfy category i. Note that, 


M M 
D and > Ope 
i=l i=1 
Now, based on (E;) and {O;}, we can compute the x? test statistic [(7.33)| 


It turns out that under Ho, the y? test statistic of [(7.33)] approximately follows a Chi-squared 
distribution with M — 1 degrees of freedom. Hence this allows us to approximate the p-value via 
(7.36)| where W is taken as such a Chi-squared random variable and u as the test statistic. This is 
also sometimes called Pearson’s chi-squared test. 


We now present an example where we assume under Hp that a die is biased, with the probabilities 
for each side (1 to 6) given by the following vector p, 


p= (0.08, 0.12, 0.2, 0.2, 0.15, 0.25). (7.37) 


Note that if there are then n observations, we have that E; = n pi. For this example n = 60, and 
hence the vector of expected values for each side is, 


E= (4.8, 72 12, 12, 9, 15). 
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Now imagine that the die is rolled n = 60 times, and the following count of outcomes (1 to 6) is 


observed, 
O-[ 2 9, TL. 8, 27) 


In Listing [7.12] below, we use this data to compute the test statistic and p-value. This is done first 
manually, and then the ChisqTest() function from the HypothesisTests package is used. 
From the output we see that the p-value is around 0.0105. Hence at the a = 0.05 level we would 
reject Hp and conclude the distribution does not follow (7.37). That is we would conclude there is 
sufficient evidence to believe the die is weighted differently to the weights p. However at a = 0.01 
we will fail to reject Ho. 


Listing 7.12: |Chi-squared test for goodness of fit 


using Distributions, HypothesisTests 


LO.083, 10.1312. 04.2, 0.24 0.15, 0525] 
e me. Da Til, de 27] 

= length (0) 

= sum(O) 
n*p 


1 
D 
3 
4 
5 
6 
Y 
8 
9 











testStatistic = sum((O-E).*2 ./E) 
pVal = ccdf (Chisq(M-1), testStatistic) 


println("Manually calculated test statistic: ", testStatistic) 


println("Manually calculated p-value: o jovial, Ni) 


printin(ChisqTest (0,p) ) 





Manually calculated test statistic: 14.974999999999998 
Manually calculated p-value: 0.010469694843220351 


Pearson’s Chi-square Test 





Population details: 


parameter of interest: Multinomial Probabilities 
value under h_0: (0.08, 0.12, 0.2, 0.2, 0.15, 0.25] 
point estimate: [0.05, 0.0333333, 0.15, 0.183333, 0.133333, 0.45] 


95% confidence interval: Tuple{Float64,Float64}[(0.0, 0.1828), (0.0, 0.1662), 
(0.0333, 0.2828), (0.0667, 0.3162), (0.0167, 0.2662), (0.3333, 0.5828)] 


Test summary: 
outcome with 95$ confidence: reject h 0 





one-sided p-value: 0.0105 
Details: 

Sample size: 60 

statistic: 14.975000000000001 


degrees of freedom: 5 


residuals: [-0.821584, -1.93793, -0.866025, -0.288675, -0.333333, 3.09839] 
std. residuals: [-0.85656, -2.06584, -0.968246, -0.322749, -0.361551, 3.57771] 
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In line 3 the array p is created, which represents the probabilities of each side occurring under Ho. In 
line 4 the array O is created, which contains the frequencies, or counts, of each side outcome observed. 
In line 5 the total number of categories (or side outcomes) is stored as M. In line 6 the total number 
of observations is stored as n. In line 7, the array of expected number of observed outcomes for each 
side is calculated by multiplying the vector of expected probabilities under Ho by the total number 
of observations n. The resulting array is stored as E. In line 9 [(7.33)|is used to calculate the Chi- 
squared test statistic. In line 10 the test statistic is used to calculate the p-value. Since under the null 
hypothesis the test statistic is asymptotically distributed according to a chi-squared distribution, the 
ccdf () function is used on a Chisq() distribution with M-1 degrees of freedom. In lines 12 and 13 
the manually calculated test statistic and p-value are printed. In line 15 the ChisqTest () function 
from the HypothesisTests package is used to perform the chi-squared test on the frequency data 
in array p. 








Chi-squared Test Used to Check Independence 


We now show how a Chi-squared statistic can use used to check for independence, as in [(7.32)| 
Consider an example where 373 individuals are categorized as Male/Female, and Smoker/Non- 
smoker, as in the following contingency table. 





Smoker | Non-smoker 
Male 18 132 
Female 45 178 























In this example, 18 individuals were recorded as ‘male’ and ‘smoker’, and so forth. Now under Ho, 
we assume that the smoking or non-smoking behavior of the individual is independent of the gender 
(male or female). To check for this using a Chi-squared statistic, we first setup (E;) and {O;} as 
in the following table. 





























Smoker Non-smoker | Total/proportion 
Male O1; = 18 O1» = 132 150 / 0.402 
Fy, = 25.34 | Ej» = 124.67 
Female O21 = 45 O22 = 178 223 / 0.598 
Fg, = 37.66 | E55 = 185.33 
Total/Proportion | 63/0.169 310/0.831 373/1 





Table 7.2: The elements (O;;) and (Ejj) as in the contingency table. 


Here the observed marginal distribution over male vs. female is based on the proportions p = 
(0.402,0.598) and the distribution over smoking vs. non-smoking is based on the proportions 
q = (0.169,0.831). Then, since independence is assumed under Ho, we multiply the marginal 
probabilities to obtain the expected observation counts, 


Eij =n pi qj. (7.38) 
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For example, E21 = 373 x 0.169 x 0.598 = 37.66. Now with these values at hand, the Chi-squared 
test statistic can be setup as follows, 


=> 


{= 


m £ 
I= 


ij — Ei)" 
(Qu Bur (7.39) 
ij 


ji 


1 


where m and £ are the respective dimensions of the contingency table (m = £ = 2 in this example). 


It turns out that under Hp the test statistic (7.39) is approximately Chi-squared distributed 
with (m — 1) x (€—1) degrees of freedom. This implies 1 degree of freedom in our example. Hence 
((7.36)| can be used to determine an (approximate) p-value for this test, just like in the previous 
example. 


Listing carries out a Chi-squared test in order to check if there is a relationship between 
gender and smoking. In this example, since the p-value is 0.0387, we conclude by saying there is 
some evidence that there is a relationship. That is, if œ = 0.05 we conclude that there is sufficient 
evidence to reject Hg. However if œ = 0.01 we conclude that there is insufficient evidence to 
reject Ho. 


Listing 7.13: |Chi-squared for checking independence 


using Distributions 


xObs = ue 19252 45 17] 

rowSums [SUIZO SAA NEO EAN] 
colSums STINEO SE EE O EIN eZ] 
n = sum(xObs) 


rowProps = rowSums/n 
colProps colSums/n 





neis ct — IeolPropslel CowEeropsirisn for r in 2 e Iin 1:2 
esusrare = suu [t (osoloxs rer el msgece lael ^2. / keee e rds x Ln dB25G sun 1221) 
pVal = ccdf(Chisq(1),testStat 











println("Chi-squared value: ", testStat) 
joueaione in rales Y) Tail) 





Chi-squared value: 4.274080056208799 
P-value: 0.03869790606536347 








In line 3 the observation counts from the contingency table are stored as the 2-dimensional array, xObs. 
In line 4 the observations in each row are summed via xObs [i, :], and the use of a comprehension. 
In line 5 the observations in each column are calculated via a similar approach to that in line 4 
above. In line 6 the total number of observations is stored as n. In line 8 and 9 the row and column 
proportions are calculated. In line 11 the expected number of observations, {F;; } (shown in T: able[7.2), 
are calculated. Note the use of the comprehension, which calculates [(7.38)| for each combination of 
sex and smoker/non-smoker. In line 12 the test statistic is MESA through the use of 
a comprehension. In line 13 the test statistic is used to calculate the p-value. Since under the null 
hypothesis the test statistic is asymptotically distributed according to a Chi-squared distribution, the 
ccdf () function is used on a Chisq() distribution with (m — 1) x (£ — 1) degrees of freedom, i.e. 1 
in this example. 
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Kolmogorov-Smirnov Test 


We now depart from the situations of a finite number of categories as in the Chi-squared test 
and consider the Kolmogorov-Smirnov test, which is based on the test statistic S from [(7-35)] 
The approach is based on the fact that, under Ho of [(7.31)] the empirical cumulative distribution 
(ECDF) Ê(-) is close to the actual CDF Fo(-). To get a feel for this notice that for every value 
x ER, the ECDF at that value, F (x), is the proportion of the number of observations less than or 
equal to x. Under Ho, multiplying the ECDF by n yields a binomial random variable with success 
probability, Fo(z): 

nF (x) ~  Bin(n,Fo(z)). 


Hence, 





Fo(z)(1— Fo(x)) 














E[F(2)] = Fo(x),  Var(F(x)) = 


See the Binomial distribution in Section Hence, for non-small n, the ECDF and CDF should 
be close since the variance for every value x is of the order of 1/n and diminishes as n grows. The 
formal statement of this is, taking n — oo and considering all values of x simultaneously is known 
as the Glivenko Cantelli Theorem. 


For finite n, the ECDF will not exactly align with the CDF. However the Kolmogorov-Smirnov 
test statistic (7.35)|is useful when it comes to measuring this deviation. This is due to the fact that 
under Ho the stochastic process in the variable, x, 


Vn(F (a) — Fo()) (7.40) 


is approximately identical in probability law to a standard Brownian Bridge, B(-), composed with 
Fo(x). That is, by denoting F;,(-) as the ECDF with n observations, we have that 


vn(É.(x)- Fix)  & B(Fo(x)), (7.41) 


which asymptotically converges to equality in distribution as n — oo. Note that a Brownian Bridge, 
B(t), is a form of a variant of Brownian Motion, constrained to equal 0 both at t = 0 and t = 1. It 
is a type of diffusion process. See for a good introduction to diffusion processes and stochastic 
calculus. 


Now consider the supremum as in the Kolmogorov-Smirnov test statistic S, as defined in|(7.35) 
It can be shown that, in cases where Fo(-) is a continuous function (distribution), as n — oo, 


vns 2 sup |B(t)|. (7.42) 
te [0,1] 


Importantly, notice that the right hand side does not depend on Fo(-), but rather is the maximal 
value attained by the absolute value of the Brownian bridge process over the interval [0,1]. It then 
turns out that (see for example for a derivation) such a random variable, denoted by K, has 
CDF, 


oo 


fon 242 (82 
= < —1-— k—1,—2k?z? _ — (2k—1)?z?/(8x ) f 
Fk(x) P( sup |B(t)| e) 1-2 J (—1) e oe 2 (7.43) 


te[0,1] k=l 
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Figure 7.5: PDF of the Kolmogorov distribution, alongside histograms of K-S 
test statistics from normal and exponential populations. 


This is sometimes called the Kolmogorov distribution. Thus to obtain an approximate p-value for 
the Kolmogorov-Smirnov test using |(7.36)| we calculate, 


p—1- Fk(Vn$). (7.44) 


Figure generated by Listing [7.14| compares the PDF of the Kolmogorov distribution to the 
empirical distribution of S scaled by y/n as on the left hand side of This is done for two 
different scenarios. In the first data is sampled from an exponential distribution, and in the second, 
data sampled from a normal distribution. As illustrated in the resulting Figure[7.5] the distributions 
of the Monte Carlo generated test statistics are in close agreement with the analytic PDF, regardless 
of what underlying distribution Fo(x) the data comes from. 
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Listing 7.14: |Comparisons of distributions of the K-S test statistic 


using Distributions, StatsBase, HypothesisTests, Plots, Random; pyplot () 
Random.seed! (0) 


n= 25 

N = 10^4 

Cicle = —i10 30 (001 c Lo 

KGrckcle— 020.085 

distl, dist2 = Exponential (1), Normal () 


Oo -q1Oo cU R0 rb. -—2 





function ksStat (dist) 

data = rand(dist,n) 

Fhat = ecdf (data) 

sqrt (n) «maximum (abs. (Fhat.(xGrid) - cdf. (dist,xGrid) ) ) 
end 


kStatsl = [ksStat (dist1) for _ in 1:N] 
kStats2 [ksStati(dist 2) fors ENI] 


foul stephist (kStatsl, bins=50, 
c=:blue, label="KS stat (Exponential)", normed=true) 
pl plot! (kGrid, pdf. (Kolmogorov(),kGrid), 
c=:red, label="Kolmogorov PDF", xlabel="K", ylabel="Density") 








p2 stephist (kStats2, bins=50, 
c=:blue, label="KS stat (Normal)", normed=true) 
p2 plot! (kGrid, pdf. (Kolmogorov(),kGrid), 
c=:red, label="Kolmogorov PDF", xlabel="K", ylabel="Density") 





plot(pl, p2, xlims=(0,2.5), ylims=(0,1.8), size=(800, 400) ) 








In lines 4-8 we specify the sample size n, number of Monte Carlo repetitions N, grids for computa- 
tion and plotting, and two distributions of the underlying population. In lines 10-14, the function 
ksStat () is created, which takes a distribution type as input, randomly samples n observations from 
it, calculates the ECDF of the data via the ecd£ () function, and finally returns the left hand side 
of |(7.42)| by calculating the K-S test statistic via and multiplying this by sqrt (n). Note 
that in line 12, the ecdf() function returns a cdf function type itself, which is stored as Fhat, and 
broadcasted over xGrid in line 13. In lines 16-17, a comprehension is used along with the ksStat () 
function to generate N K-S test statistics for each distribution of the population. The remainder 
of the code compares histograms of kStats1 and kStats2 against PDFs of the Kolmogorov () 
distribution. 





Now that we have demonstrated that the distribution of the scaled Kolmogorov-Smirnov statistic 
is similar to the distribution of K as in we demonstrate how the Kolmogorov-Smirnov 
statistic can be used to carry out a goodness of fit test. For this example, consider that a series 
of observations has been made from some unknown underlying gamma distribution with shape 
parameter 2 and mean 5. The question we then wish to ask is: given the sample observations, 
is the underlying distribution exponential with the same mean? The answer is false because an 
exponential distribution is a gamma distribution with a shape parameter of 1, not 2. However 
with a finite number of observations, such as n = 100 in our case, we can only expect to give an 
approximate answer. 
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Figure 7.6: Left: CDFs and ECDF. Right: K-S processes scaled over [0, 1]. 


To help illustrate the logic of the approach, see Figure generated by Listing|7.15| The left 
plot presents the ECDF of the data plotted against the actual CDF (blue) as well as the postulated 
CDF (red). The ECDF follows the actual CDF quite closely but does not follow the postulated 
CDF well. Keep in mind that in a practical situation, we don't know the actual CDF. We only know 
the postulated CDF. Still we expect mild deviations under Ho, such that when composed with the 
CDF, behave approximately as a Brownian bridge (defined for t € [0,1]). In contrast, if irregular 
deviations appear we can conclude that Hg does not hold. For this look at the right hand plot 
of Figure The observed deviations (time stretched by the CDF) are in the red curve. Such a 
trajectory of a Brownian bridge is possible but not plausible. Instead, most trajectories will behave 
more like the blue curve. 


Now the Kolmogorov-Smirnov statistic is useful. Under Ho (and for non-small n) it needs 
to follow a CDF as (7.43). Hence this CDF can be used to compute the p-value using (7.44). 
Listing computes the p-value by manually calculating a truncation of the series in (C 
using the Kolmogorov () distribution object, and by using ApproximateOneSampleKSTest () 
from the HypothesisTests package. As can be observed from the output, the resulting p-values 
are all in agreement at approximately 0.0545. Observe now that if a = 0.05 then we fail to reject Ho 
because there isn't sufficient evidence that the distribution deviates from an exponential distribution. 
With this example (using the same random number generation seed), if you were to increase the 
number of observations to n — 200, then the p-value changes to 0.0004, meriting rejection under 
any sensible significance level. 
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Listing 7.15: ECDF, actual and postulated CDF’s, and their differences 





using Random, Distributions, StatsBase, Plots, HypothesisTests,Measures; pyplot () 
Random. seed! (3) 


dist = Gamma(2, 2.5) 

distH0 = Exponential (5) 
= 100 

data = rand(dist,n) 





Fhat = ecdf (data) 

culticio (Chiste, 3x) = mem (im) + (Orla -— alt, (elise, Es) ) 
xGrid = 0:0.001:30 

ksStat = maximum(abs.(diffF(distHO, xGrid))) 


= 10%5 
KOSCE (E) = sere (201) sta es (E (iS) A2eyjon (Bay) ora le Ma LN) 


println("p-value calculated via series: ", 
1-KScdf (ksStat)) 

println("p-value calculated via Kolmogorov distribution: ", 
1-cdf (Kolmogorov(),ksStat),"\n") 





println(ApproximateOneSampleKSTest (data, distH0) ) 


pl alot (Gxeneahgl, Paar (Cicle!) , 
c=:black, lw=1, label="ECDF from data") 
pi plot! (erie echt. (Chisty Gri) y 
C= plue als coa deed under Nim actua ences tsa DUELON) 
pl plor l (erie, CoE, (oist HO, serere) y 
c=:red, ls-:dot, label="CDF under Mn postulated HO", 
xlims=(0,20), ylims=(0,1), xlabel = "x", ylabel = "Probability") 





¡92= ploc (rel. Chistr sGitalal), clita (clisic, x«t) 0D, 
c=:blue, label="KS Process under \n actual distribution") 
= plot! (cdf. (distH0,xGrid), diffF(distHO, xGrid), lw=0.5, 
c=:red, xlims=(0,1), label="KS Process under Mn postulated H0", 
xac Melo vilabedy— cs Process.) 





plot(pl, p2, legend=:bottomright, size=(800, 400), margin = 5mm) 





p-value calculated via series: 0.05473084786694438 
p-value calculated via Kolmogorov distribution: 0.054730847866944266 


Approximate one sample Kolmogorov-Smirnov test 





Population details: 


parameter of interest: Supremum of CDF differences 
value under h_0: 0.0 
point estimate: 0.13421930779083405 


Test summary: 
outcome with 95% confidence: fail to reject h_0 





two-sided p-value: 0.0545 
Details: 
number of observations: 100 


KS-statistic: 1.3421930779083404 
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In lines 4 and 5 we set the actual underlying distribution (gamma), and postulated distribution (ex- 
ponential) respectively. In line 6 we set the number of observations, n. The data is generated in line 7. 
The ECDF is created in line 9 and the process is defined in line 9 via our function diffF () 
allowing different postulated (Ho) distributions. The Kolmogorov-Smirnov statistic is calculated in 
line 12. Line 15 implements our function KScdf () by truncating the series in to M. We use 
it, as well as the Kolmogorov () distribution object to print out p-values in lines 17-20. Simiarly 
in line 21 we use ApproximateOneSampleKSTest () from HypothesisTests (). The remainder 
of the code creates Figure where a point to note are the horizontal-axis values in lines 32 and 34 
reflecting the composition era 








Testing for an Independent Sequence 


One may also carry out some hypothesis tests for checking if a sequence of random variables is 
ii.d. (independent and identically distributed). To illustrate one such test, we introduce the classic 


Wald- Wolfowitz runs test. Consider a sequence of data points x1,...,x,4 with sample mean x. For 
simplicity assume that no point equals the sample mean. Now transform the sequence to y1,..., Yn 
via, 


yi = xi — T. 


We now consider the signs of y;. For example, in a dataset with 20 observations, once considering 
the signs we may have a sequence such as: 


+-+ +4 | b+ +4, 





indicating that the first is positive (greater than the mean), the second is negative (less than the 
mean), the third is positive, the fourth is negative and this continues until the last four positive 
signs. Note that we assume no exact 0 for y; and if such exist we can arbitrarily assign them to be 
either positive or negative. We then create the random variable, R, counting the number of runs in 
this sequence, where a run is a consecutive sequence of points having the same sign. In our example 
the runs (visually separated by white space) are, 


+ | +4 | TG 





Hence here R = 9. The essence of the Wald-Wolfowitz runs test is an approximation of the distri- 
bution of R under Ho. The null hypothesis is that the data is i..d. In that case, R can be shown 
to approximately follow a normal distribution with mean y and variance o”, where, 

— 1)(u—2 
(u — 1)(u — 2) (7.45) 


so +1, pea AF _ n, 
n n= 1 





Here n4 is the number of positive values and n_ is the number of negative values. Note that 
n, and n— are also random variables. Clearly n} + n- = m, the total number of observations. 
In our example, n = 20, ny = 11, n- = 9. You can use to compute the mean and variance: 
u = 10.9 and c? = 4.64. With such values at hand the test creates the p-value via, 


ATH), 


c 
where Z is a standard normal random variable. In this example the p-value is 0.38 and hence we 
do not reject Ho under any plausible œ and don’t conclude that there is any apparent violation of 





2P(Z > | 
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Figure 7.7: The distribution of the (approximate) p-value under Ho for a 
Wald-Wolfowitz runs test. 


model assumptions due to this test. However, if the p-value would have been significantly smaller 
we may have reason to suspect that some model assumptions are violated. 


In Listing [7.16] we implement a Wald-Wolfowitz runs test for synthetic data. We generate 109 
samples of i.i.d. standard normal sequences, each of length n = 1,000. Since this agrees with Ho of 
the Wald-Wolfowitz runs test, we expect that the p-value follow an approximate uniform distribution 
(see Listing [7.19]of Chapter [7}. To explore this we plot a histogram of the p-value and its ECDF in 


Figure 


The distribution as appearing via the histogram is clearly not uniform. Spikes of high density 
appear and these are due to lattice effects (we are approximating a discrete random variable R, with a 
normal random variable). Nevertheless, the ECDF indicates that the distribution is almost uniform. 
The output of Listing [7.16] also presents the area to the left of the p-value for several specified o and 
illustrates that there is agreement. Hence the normal approximation with parameters as in 
appears to be very close. 
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Listing 7.16: |The Wald-Wolfowitz runs test 


using CSV, GLM, StatsBase, Random, Distributions, Plots, Measures; pyplot () 
Random. seed! (0) 


i, N= 10°3, 10°6 
function waldWolfowitz (data) 


n = length (data) 
sgns = data .>= mean (data); 


Oo -q1Oo cU R0 rb. -2 


nPlus, nMinus = sum(sgns), n - sum(sgns) 
wwMu = 2*nPlus*nMinus/n + 1 
wwVar = (wwMu-1) * (wwMu-2) / (n-1) 


R= 1 
for wind ms 

R += sgns[i] != sgns[i+1] 
end 


zStat = abs ((R-wwMu) /sqrt (wwVar) ) 
Z2*eccdf (Normal (), zStat) 
end 


experimentPvals = [waldWolfowitz(rand(Normal(),n)) for _ in 1:N] 
for alpharin || (gg, O.005, O.01, O.05, 0.3] 

pva = sum(experimentPvals .< alpha) /N 

println("For alpha = $(alpha), p-value area = $(pva)") 
end 








pl = histogram(experimentPvals,bins = 5n, legend = false, 
xlabel = "p-value", ylabel = "Frequency") 


Fhat = ecdf (experimentPvals) 





¡ciel = 080. (00 gd 
p2 = plot (pGrid,Fhat. (pGrid), legend =false, xlabel = "p-value", ylabel ESE) 
plot (p1, p2, size = (1000,400), margin = 5mm) 








For alpha = 0.001, p-value area = 0.000855 
For alpha = 0.005, p-value area = 0.005196 
For alpha = 0.01, p-value area = 0.010739 
For alpha = 0.05, p-value area = 0.05272 
For alpha = 0.1, p-value area = 0.094269 





In lines 6-20 we implement the waldWolfowitz () function. In line 8 we compare data to its mean 
and create a sequence of signs, sgns. Using the sign () function would have been possible, however 
by using >= we ensure to break ties with 0. We implement in lines 9-11 and then lines 13-16 
create the number of runs, R. We then calculate the p-value and return it in line 19. In line 22 we 
generate N p-values for the synthetic data for a sample size of n. These are then analyzed in lines 
23-26 where we estimate the proportion of p-values less than a specified alpha and print these out 
comparing to alpha. The remainder of the code creates Figure Note the use of ecdf () from 
StatsBase, similar to Listing [4.17] of Chapter [4] 
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7.5 More on Power 


In this section the concept of power is covered in greater depth together with related aspects 
of hypothesis testing such as the distribution of the p-value. Recall that as first introduced in 
Section [b.6|and summarized in Table[.1] the statistical power of a hypothesis test is the probability 
of correctly rejecting Ho as is given by 1 — P(Type II error). We now reinforce this idea through a 
concrete introductory example. 


Consider a normal population with unknown parameters u and c, and say that we wish to 
conduct a one-sided hypothesis test on the population mean using the following hypothesis test 
set-up, 

Ho: u = uo and Hı : u > mo. (7.46) 


Importantly, since power is the probability of a correct rejection, if the underlying (unknown) 
parameter u varies greatly from the value under the null hypothesis uo, then the power of the test in 
this scenario is greater. Likewise, if the underlying parameter does not vary greatly from the value 
under the null hypothesis, the power of the test is lower. Similar affects occur due to the underlying 
(unknown) variance as well as the sample size. A lower variance implies higher power and larger 
sample sizes increase power. Also reducing (improving) a will decrease the power and vise-versa. 


In Listing [7.17] below, several different scenarios, labelled A, B, C, and D, are considered. For 
each, N test statistics are calculated via Monte Carlo simulation for N sample groups and the power 
of the test is estimated. First the underlying mean equals the mean under the null hypothesis and 
hence the power equals a. Then in each subsequent scenario, the parameters or sample size are 
changed in a way that power is increased. First in scenario A, the underlying mean is increased, 
then in scenario B it is increased further. Further in scenario C the sample size is increased, and 
finally in scenario D the standard deviation is decreased. As you can observe from the output of 
Listing [7.17] each of these incremental changes increases the power up to approximately 0.91 in case 
D. In practice, if keeping o constant, it is only the sample size that can be controlled, however 
understanding the effect of the other parameters on the power is important in deciding how large 
samples should be. 


For some of these scenarios Listing[7.17]employs kernel density estimation to plot the distribution 
of the test statistics. The resulting Figure is similar to Figure however in this case the 
focus is on power.. The power under different scenarios is given by the area under each PDF to 
the right of the critical value boundary. Hence the more ‘separation’ that we can achieve from the 
distribution under Ho, the better. As a side point, note that the curves shown in Figure [7.8] could 
have alternatively been obtained analytically via the non-central T-distribution, a classical concept 
that we do not cover further. 
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Listing 7.17: |Distributions under different hypotheses 


using Random, Distributions, KernelDensity, Plots, LaTeXStrings; pyplot () 
Random.seed! (1) 


function tStat (mu0, mu, sig, n) 
sample = rand(Normal (mu, sig),n) 
xBar = mean (sample) 
S = std(sample) 
(xBar-mu0) / (s/sqrt (n)) 

end 


mu0, mulA, mulB 22, 24 
Gnügp. ja — 4^ 1 

my z 0^6 

alpha = 0.05 


dataH0 [tStat (mu0,mu0,sig,n) for X in 1:N] 
dataH1A [tStat (mu0,mulA,sig,n) for _ in 1:N] 
dataH1B Estat (muo, mu lB sug, Lor sean ele: NI 
dataH1C [tStat (mu0,mu1B,sig,2*n) for _ in 1:N] 
dataH1D = [tStat (mu0,mulB,sig/2,2*n) for _ in 1:N] 


Terit = cuentile (mise (mel) yl =aljolma) 
estPwr (sample) = sum(sample .> tCrit)/N 


print (Mrege Elo ou a ay A Elba) 

println("Power under HO: ", estPwr (dataH0)) 

println("Power under H1A: ", estPwr (dataHlA)) 

println("Power under H1B (mu’s farther apart): ", estPwr (dataH1B)) 
println("Power under H1C (double sample size): ", stPwr (dataH1C)) 
println("Power under H1D (like H1C but std/2): ", estPwr (dataH1D)) 








kHO = kde (dataHO0) 
kH1A kde (dataH1A) 
kH1D = kde (dataH1D) 
meric = =11030. 1815 


plot (xGrid, pdf (kH0,xGrid), 

c=:blue, label="Distribution under H0") 

plot! (xGrid, pdf (kH1A,xGrid), 

c=:red, label="Distribution under H1A") 

plot! (xGrid, pdf (kH1D,xGrid), 

c=:green, label="Distribution under H1D") 

plot! ( [textes teri |, 10,0.4], 
c=:black, ls=:dash, label="Critical value boundary", 
xlims=(-5,10), ylims=(0,0.4), xlabel=L"\Delta = \mu Amao 











Rejection boundary: 2.131846786326649 

Power under HO: 0.049598 

Power under H1A: 0.134274 

Power under H1B (mu's farther apart): 0.281904 
Power under H1C (double sample size): 0.406385 
Power under H1D (like H1C but std/2): 0.91554 
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Figure 7.8: Numerically estimated distributions of the test statistic for 
various scenarios of values of the underlying parameter p. 





In lines 4-9 the function tStat() is defined, which returns the value of the test statistic for a 
randomly generated sample. Note here that mu0 represent the value under the null hypothesis as used 
to calculate the test statistic while mu represents the value of the actual underlying mean, used to 
generate random samples. In line 11 we define different values of u, matching Ho, scenario A, and 
scenario B. In line 12 the standard deviation and the number of observations is defined. In line 13 
we define the number of Monte Carlo repetitions and in line 14 we define the significance level. In 
lines 16-20, the tStat () function is used along with a series of comprehensions to generate N test 
statistics for each of the scenarios as described above. In line 22, the critical value for the significance 
level alpha is calculated by using the quantile() function on a T-distribution TDist (), with 
n-1 degrees of freedom. In line 23 the function estPwr () is defined, which takes an array of test 
statistics as input, and then approximates the corresponding power of the scenario as the proportion 
of statistics that exceeds tCrit calculated previously, i.e. the proportion of cases for which the null 
hypothesis was rejected. Note the use of the .> which returns an array of true, false values, which 
are then summed up and divided by N. Lines 25-30 print the output and estimate the power for each of 
the scenarios. Then lines 32-34 create UnivariateKDE objects representing kernel density estimates 
of the test statistics. These are then plotted in lines 37-42 using the pdf () function with a method 
used for kernel density estimates. Lines 43-45 plot the critical value. 








Power Curves 


From Listing[7.17] we can see that the statistical power of a hypothesis test can vary. It depends 
not only on the parameters of the test, such as the number of observations in the sample group n 
and the specified confidence level a, but also on the underlying parameter values y and ø. Hence, 
a key aspect of experimental design involves determining the test parameters such that not only is 
the probability of a type I error controlled, but that the test is sufficiently powerful over a range 
of different scenarios. This is important, as in reality there are an infinite number of possibilities 
in Hy, any one of which could describe the underlying parameters. By designing a statistical test 
that has sufficient power, we aim to have confidence that if the underlying parameter deviates from 
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Figure 7.9: Power curves for the one-sided T-test 
with various sample sizes at a = 0.05. 


the null hypothesis, then this will be identified. Such planning is often aided by inspecting power 
curves. 


A power curve is a plot of the power as a function of certain parameters. To illustrate this we 
continue with the hypothesis formulation Listing estimates the power of a one sided 
T-test. We estimate power of hypothesis test setup over a range of different values of u, for various 
sample size of n — 5,10,20,30. For each sample size, the power is estimated and the result is a 
power curve that can be plotted as in Figure 


Our focus is for y > 20. It can be seen that as the number of sample observations increases, the 
statistical power of the test increases. Similarly the large u is (compared to uo = 20) the higher 
the power. Observe also that at u = ug the power is a = 0.05. An interesting subtle point to 
note is that where y « 20, the ordering of the curves is reversed. For example, one can see that in 
this region the scenario where n — 30 has less power than that for n — 5 due to the fact that the 
probability of rejecting the null hypothesis at all is lower. Another point to note is that the x-axis 
could be adjusted to represent the difference between the value of mug under the null hypothesis, 
and the various possible values of u as was done in Figure[7.8] Furthermore one could make the axis 
scale invariant by dividing said difference by the standard deviation. Such curves are often seen in 
experimental design reference material. 


With such a plot at hand, assume we are planning a costly (in the sample size) hypothesis test 
aiming to show that u > 20. Say that for sample size planning purposes, we have reason to believe 
that u > 24. Assume now that after fixing a = 0.05 we wish to have power greater than 0.6. We 
then consider between the option of n — 5, n — 10, or n — 20 observations. Figure [7.8] illustrates 
that n = 10 is a sufficient number of observations. However, if we had plausible reason to believe 
that (under H1) u = 22 then n = 30 observations are required to attain power > 0.6. 
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Listing 7.18: |Power curves for different sample sizes 


using Distributions, Plots, LaTeXStrings, Random; pyplot () 


function tStat (mu0,mu,sig,n) 
sample = rand(Normal (mu, sig) ,n) 
xBar = mean (sample) 
S = std(sample) 
(xBar-mu0) / (s/sqrt (n)) 
end 





function powerEstimate (mu0,mul,sig,n,alpha,N) 
Random.seed! (0) 
sampleH1 = [tStat(mu0,mul,sig,n) for _ in 1:N] 
critVal = quantile(TDist (n-1),1-alpha) 
sum(sampleH1 .> critVal)/N 

end 


mu0 = 20 

sig = 5 

alpha = 0.05 

N = 10^4 

rangeMul = 16:0.1:30 
iubet = (5,10, 20, 30) 








powerCurves = [powerEstimate. (mu0, rangeMul,sig,n,alpha,N) for n in nList] 


plot (rangeMul, powerCurves[1],c=:blue, label="n = $(nList[1])") 
plot! (rangeMul, powerCurves[2],c=:red, label="n = $(nList[2])") 
plot! (rangeMul, powerCurves[3],c=:green, label="n = $(nList[3])") 
plot! (rangeMul, powerCurves[4],c=:purple, label="n = $(nList[4])", 
xlabel= L"\mu", ylabel="Power", 
xlims=(minimum(rangeMul) ,maximum(rangeMul)), ylims=(0,1)) 























In lines 3-8 the function tStat () is defined in an identical manner to Listing In lines 10-15, 
the function powerEstimate() is created. It uses Monte Carlo to estimate the power of the one 
sided hypothesis test [(7.46)] given the value under the null hypothesis mu0, and the actual parameter 
of the underlying process mul. The other arguments of the function include the sample size n, the 
actual standard deviation sig, the significance level alpha, and the total number of Monte Carlo 
repetitions. Since we use common random numbers (see Section [10.6], in line 11 we fix the seed. In 
line 12 tStat () is used along with a comprehension to generate N test statistics from N independent 
sample groups. The test statistics are then stored as the array sampleH1. In line 13, the critical value 
for the given scenario of inputs is calculated in the same manner as in line 22 of Listing In line 
14 the proportion of test statistics greater than the critical value is calculated using the same approach 
as that of line 23 of Listing [7.17] In lines 17-22 the parameters of the problem are specified. The value 
under the null hypothesis mu0, the underlying variance of the unknown process sig, and the number 
of Monte Carlo repetitions N. The range over which the underlying mean of the process mul will be 
calculated is specified as rangeMul. A list of power curves depending on sample size is specified in 
nList. The actual simulation is carried out in lines 24 where powerCurves is an array of arrays 
(one for each sample size in nList) with each entry being an array matching powerEstimate () 
broadcasted over rangeMul. The remaining lines plot the curves. 
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—— Under HO 
— Under H1A 
—— Under H1B 
--- a 


p-value 


Figure 7.10: The distribution of the p-value under Ho vs. the distribution of 
the p-value under H; with y = 22 (A) and Hi with y = 24 (B). 


Distribution of the p-value 


Throughout this chapter, equations of the form p = P(S > u) were presented, where S is a 
random variable representing the test statistic, u is the observed test statistic, and p is the p-value 
of the observed test statistic. We now explore the distribution of the p-value. 


To discuss the distribution of the p-value, denote the random variable of the p-value by P 
Hence P = 1 — F(S), where F(-) is the CDF of the test statistic under Ho. Note that P is just 
a transformation of the test statistic random variable S. Assume that S is continuous and assume 
that Ho holds, hence P(S < u) = F(u). We now have, 

















P(P > x) =P(1- F(S) > 2) 
= P(F(S) < ~ 
=P(S < F7(1-— {x)) 
= F(F-^(- z)) 
= la. 


Recall that for a uniform(0,1) random variable, the CCDF is 1—x on z € [0,1]. Therefore under Ho, 
P isa uniform(0,1) random variable. This agrees with the fact that under Ho, the chance of rejecting 
Ho is exactly a (this happens when p < a). 


If Ho does not hold then P(S < u) 4 F(u) and the derivation above fails. In such a case the 
distribution of the p-value is no longer uniform. In fact, in such a case, if the setting is such that the 
power of the test increases, then we expect the distribution of the p-value to be more concentrated 
around 0 than a uniform distribution. 


In Listing we revisit Ho, scenario A, and scenario B from Listing |7.17| For each case we 
simulate Monte Carlo repetitions of the p-value, estimate the power, and plot the distribution in 
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Figure Note that the power is the area to the left of a in the figure and as you can see from 
the output, it agrees with the output of Listing 


Listing 7.19: |Distribution of the p-value 


using Random, Distributions, KernelDensity, Plots, LaTeXStrings; pyplot () 
Random. seed! (1) 


function pval (mu0, mu, sig,n) 
sample = rand(Normal (mu, sig),n) 
xBar = mean (sample) 
S = std(sample) 
tStat = (xBar-mu0) / (s/sqrt (n)) 
cele (Mails (mel), ieSiceic)) 

end 


Oo -q10 C| 4 05 t5 


QUO, uva, uds 20752 2 2 22 
SiG, a, N= T2 Be LOG 


pValsHO = [pval(mu0,mu0,sig,n) for _ in 1:N] 
pValsH1A = [pval(mu0,mulA,sig,n) for _ in 1:N] 
pValsH1B = [pval(mu0,mu1B,sig,n) for _ in 1:N] 


alpha = 0.05 
estPwr(pVals) = sum(pVals .< alpha) /N 


println("Power under HO: ", estPwr (pValsH0)) 
println("Power under H1A: ", estPwr(pValsH1A)) 
println("Power under HIB (mu's farther apart): ", estPwr (pValsH1B)) 


stephist (pValsH0, bins=100, 
normed-true, c=:blue, label="Under HO") 
stephist! (pValsH1A, bins=100, 
normed=true, c=:red, label="Under H1A") 
stephist! (pValsH1B, bins=100, 
normed=true, c=:green, label="Under H1B", 
xlims=(0,1), ylims=(0,6), xlabel = "p-value", ylabel = "Density") 











plot! ({alpha, alpha], [0,6], 
c=:black, ls=:dash, label=L"\alpha") 





Power under HO: 0.049598 
Power under H1A: 0.134274 
Power under H1B (mu's farther apart): 0.281904 





In lines 4 to 10 the function pval () is defined. It is similar to the t Stat () function from Listings[7.17] 
and [7.18] but includes the extra line 9 which calculates the p-value from test statistic of line 8. Note 
the use of the ccdf () function. After defining the parameters in a way similar to the previous listings, 
we create arrays of simulated p-values in lines 15-17. These are then used to estimate the power in 
each case using our function estPwr () from line 20. The remainder of the code creates Figure|7.10 





Chapter 8 


Linear Regression and Extensions - 
DRAFT 


We now explore regression analysis, one of the most popular statistical techniques used in 
practice. We focus on linear regression. The key idea idea is to consider a dependent variable Y 
and see how it is affected by one or more independent variables, typically denoted by X. That is, 
regression analysis considers how X affects Y and helps build models for predicting future values of 
Y given future observed or postulated values of X. To build such models, we utilize collected (x, y) 
data and assume that future situations are resembled by the data at hand. 


The variable Y is sometimes called the response variable or dependent variable and X is the 
explanatory variable or independent variable. Often there is a vector of explanatory variables, still 
denoted X. In the context of machine learning, we often call the elements of X or transformations 
of them, features. That is, X is the feature vector (or scalar) that helps us predict the value of Y. 


When considering Y and X as random variables (with X possibly vector-valued), the phrase 
‘regression of Y on X” signifies the conditional expectation of Y given an observed value of X, say 
X = x. That is, one may stipulate that at onset, both X and Y are random and given some 
observed value of X, it is no longer random and the regression is given by, 


p= EY |X =z]: (8.1) 














Here, j is a predictor of the dependent variable Y, given an observation of the independent vari- 
able X. We can then say f(x) is the predicted value and use any input argument x for the prediction. 


The simplest regression model for scalar X, assumes that the regression function is affine (i.e. 
linear with an intercept term). That is, 
§ =BIY |X =a] = fo + Bic. 


Here fo and f, describe the intercept and slope respectively. In this case, a typical statistical model 
is to assume X is non-random and set, 














Y = Bo + fix +e, 


where e is considered as a noise term. In the basic case, e is taken as a normally distributed random 
variable independent of everything else, with a variance that does not depend on X. 
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A widely used method for finding estimates for Py and f, is least squares. Given a series 
of observation tuples, (11,Y1),...,(Un,Yn), which can be viewed as a “cloud of points’, the least 
squares method finds the so-called ‘line of best fit’, y = Bo + Bix, where Bo and f, are estimates of 
Bo and 61 that are assumed to exist but are unknown. The estimates minimize a least squares cost, 
also known as a loss function. This is described in detail in this chapter. 


'The ideas of regression that we cover here span a broad spectrum of fields including engineering, 
statistics, data science, machine learning, and many aspects of artificial intelligence. Hence in 
different domains, there is often a tradition to focus on different aspects of regression. For example, 
from an engineering or a machine learning perspective, one often focuses on the mechanics of least 
squares without giving much consideration to modeling assumptions and model interpretation. In 
such fields, the key is to create models based on training data with the aim making good predictions. 
As opposed to that, a classical statistical perspective applied to science (e.g. crop-science), often 
merits considering model assumptions in depth, so as to validate parameter interpretation and test 
for cause-and-affect relationships. Also, in machine learning, data science, and artificial intelligence, 
one may attempt to use advanced regression (or ‘regression like’) models to create automated systems 
that predict well. Such usage of regression is covered partly towards the end of this chapter as we 
introduce GLM (Generalized Linear Models). Such an approach is also covered more broadly in 
the next chapter where we consider several methods from supervised learning dealing with both 
regression and classification problems in machine learning. 


As there are various approaches to considering regression, we try to strike a balance between them 
in the current presentation. For example in Section B.1|we focus on the many mechanistic aspects of 
least squares without considering any model assumptions. This is the basic engineering viewpoint. 
As opposed to that, Section presents the basic linear regression model with one variable and 
its full statistical analysis including hypothesis testing, confidence intervals, and checking of model 
assumptions. Then subsequent sections progress into the rich world of vector valued X, where we 
try to introduce the main concepts in a way that is useful both for classic statisticians as well as 
machine-learning professionals. In Section [S.3] we depart from simple linear regression and extend to 
vector valued X. In Section|8.4]we show how extra features could be incorporated to the model when 
dealing with non-linear aspects, categorical variables, and interaction effects. Then in Section 
we touch the complicated area of model selection. In Section [8.6] we discuss logistic regression and 
other generalized linear models without considering many statistical aspects, but rather taking a 
simple hands-on machine learning perspective. We then close with a brief taste of how to apply 
regression to time-series in Section [8.7] and also describe a few other aspects of time-series analysis. 


From a software perspective, the key tool used in this chapter is the Julia GLM.31 package, 
standing for ‘Generalized Linear Models’. It is used for ‘linear models’ but also allows for GLM. 
The reader should keep in mind that other Julia packages may also be suitable for regression 
including Flux.jl, MultivariateStats.jl, and Lasso.jl. The former two are used in the 
next chapter, while Lasso. 31 is used briefly in this chapter in the context of model selection. 





Since regression analysis is a broad subject that sits at the heart of statistics, machine learning, 
data science, and artificial intelligence, one can spend much time exploring details. This chapter only 
presents key concepts and further exploration may be carried out by considering several additional 
textbooks. For an elementary introduction in the context of scientific applications we recommend 
B10]. Then a much more comprehensive (yet still classic) treatment is in [M17]. More modern 
approaches are in [EH16] as well as [HTFO1]. Finally we also recommend Chapter 5 of [KBTV19]. 
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8.1 Clouds of Points and Least Squares 


WITH THE EXCEPTION OF CODE 
THE CONTENT OF THIS SECTION IS OMITTED IN THE DRAFT 


Listing 8.1: Polynomial interpolation vs. a line 


using Plots; pyplot () 


mele = [=2-3, 5,6, 12,14] 
vals = [7,293,123] 
n = length(xVals) 


W = [kveles fal] (5) ere 1 shay Mini, 5 au Osm=1] 
c = V \ yVals 

xGrid = -5:0.01:20 

ici (ye) = CP eal es sb ta Weim] 


beta0, betal = 4.58, 0.17 
f2(x) = beta0 + betalx*x 


plot (xGrid, f1. (xGrid), c=:blue, label="Polynomial 5th order") 
plot! (xGrid, f2. (xGrid),c=:red, label="Linear model") 
scatter! (xVals,yVals, 
c=:black, shape=:xcross, ms=8, 
label="Data points", xlims=(-5,20), ylims=(-50,50), 
xlabel = "x", ylabel = "y") 
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Listing 8.2: |L1 and L2 norm minimization by Monte Carlo guessing 


using DataFrames, Distributions, Random, LinearAlgebra, CSV, Plots; pyplot () 
Random. seed! (0) 


data eS read (Ue latc AS e clas IG SI 
xVals, yVals, n = data.X, data.Y, size(data) [1] 


= 10^6 
alphaMin, alphaMax, betaMin, betaMax = 0, 5, 0, 5 


alphal, betal, alpha2, beta2, bestlLlCost, bestL2Cost = 0.0,0.0,0.0,0.0,Inf,Inf 
fore ain IN 
rAlpha, rBeta=rand (Uniform(alphaMin, alphaMax) ), rand (Uniform(betaMin,betaMax)) 
L1iCost = norm(rAlpha .+ rBeta*xxVals - yVals,1) 
if LlCost < bestL1Cost 
global alphal = rAlpha 
global betal = rBeta 
global bestLlCost = LlCost 
end 
L2Cost = norm(rAlpha .+ rBetaxxVals 
if L2Cost < bestL2Cost 
global alpha2 = rAlpha 
global beta2 = rBeta 
global bestL2Cost = L2Cost 
end 
end 


println("L1 line: $(round(alphal, digits 2)) + $(round(betal,digits = 
2») ap 9q 





println("L2 line: $(round(alpha2,digits ) round (beta2,digits 
= yVals - (alpha2 .+ beta2xxVals) 


estemos (Ex, Ye e) = Shape lz s 10,e,d,0,01, s s [95076 01) 


pl = scatter (xVals, yVals, c=:black, ms=5, label="") 
pl = plot! ([0,10], [alphal, alphal .+ betal«10], c=:blue, label="L1 minimized") 
for i in I:n 
x,y = xVals[i],yVals[i] 
pi = elor! (lle, El, y, alliciaail 2s loeiceils<||,colloc=VNollacle', Jegleel-uu, 
end 


p2 Scatter(xVals,yVals, c=:black, ms-5, label="") 
p2 = plot!([0,10],[alpha2, alpha2 .+ beta2x10],c=:red, label="L2 minimized") 
fors in dL gin 
“ow = Wals ii] ,wvale [i] 
p2 = plot! (rectangle(x,y,d[i]), fc=:gray, fa=0.5, label="") 


p3 = scatter(xVals,yVals, c=:black, ms=5, label="") 
p3 plot!([0,10],[alphal, alphal .+ betal«10], c=:blue, label="L1 minimized") 
p3 plot! ([0,10], [alpha2, alpha2 .+ beta2x10], c=:red, label="L2 minimized") 





lor (fal, 2, 193, leyowue = (1,3), 
ratio=:equal, xlims=(0,10), ylims=(0,10), 
legend=:topleft, size=(1200, 400), 
xlabel = "x", ylabel = "y") 
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Listing 8.3: [Computing least squares estimates 


using DataFrames, GLM, Statistics, LinearAlgebra, CSV 
dera = TCS read Um daca anni daca Hesr a) 

mVals, yVYals A ceacalls, 2] 

n = length(xVals) 

A [ones (n) xVals] 


# Approach A 

xBar, yBar = mean(xVals),mean(yVals) 

sXX, sXY = ones(n)'«(xVals.-xBar).^2 , dot(xVals.-xBar,yVals.-yBar) 
blA = sXY/sXX 

bOA = yBar - blAxxBar 


0 -q1Oo cU 0r. -—2 


# Approach B 
b1B = cor (xVals,yVals)-*x(std(yVals)/std(xVals)) 
bOB = yBar - blB*xBar 


# Approach C 
b0C, b1C = A'A \ A'yVals 


# Approach D 
Adag = inv(A’ xA) *A’ 
b0D, b1D = AdagxyVals 


# Approach E 
bOE, b1E = pinv(A) *yVals 














# Approach F 
bOF, b1F = A\yVals 


# Approach G 

F = qr (A) 

Dn Boe fy Bak 

b0G, b1G = (inv(R)x*Q')x*yVals 


# Approach H 

F = svd(A) 

W Sip US = Ew. Diagonal (IL a/ PS), Je)" 
POH, PIH = (Vx*Spx*Us) *yVals 


# Approach I 
eta, eps = 0.002, 10^-6. 
ly, bere = [0,0], (il, il 
while norm(bPrev-b) >= eps 
global bPrev = b 
global b = b - etax2xA’x (Axb - yVals) 
end 
IQ, loli = ti ep 


# Approach J 
modelJ = l1m(Gformula(Y ~ X), data) 
b0J, b1J = coef (modelJ) 


# Approach K 
modelK = glm(@formula(Y ~ X), data, Normal ()) 
bOK, b1K = coef (modelK) 


printlin (round. ([b0A,b0B,b0C,b0D,b0 
prime iin (romaich (Pola, oits lie, 19100), lov 


E,b0F,b0G,b0H, b0O1I,b0J, b0K], digits=3) ) 
8 107,016, oli oli olg Te dE] ze lgacus) ) 
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8.2 Linear Regression with One Variable 


WITH THE EXCEPTION OF CODE 
THE CONTENT OF THIS SECTION IS OMITTED IN THE DRAFT 


Listing 8.4: ¡Simple linear regression with GLM 


using DataFrames, GCLM; Statistics, CSV, Plots; pyplot () 
data = CSV.read("../data/weightHeight.csv") 


lmi im(@formula(Height ~ Weight), data) 
1m2 fit (LinearModel, (formula (Height ~ Weight), data) 


glm1 = glm(@formula(Height ~ Weight), data, Normal (), IdentityLink()) 
glm2 = fit (GeneralizedLinearModel, @formula(Height ~ Weight), data, Normal(), 
IdentityLink()) 








println("x*««Output of LM Model:") 
println(lml) 

println("Nn«««Output of GLM Model:") 
jexe ciere La (eLm) 


pred(x) = coef(1ml1)'«[1, x] 


println("\n«x**Individual methods applied to model output:") 
println("Deviance: ",deviance (lm1)) 

jore trang Jia stoner aero A ysis cle itera o (lodo) i) 
println("Degrees of freedom: ",dof_residual (1lm1) ) 
println("Covariance matrix: ",vcov(lml) 


yVals = data.Height 
SStotal = sum((yVals .- mean(yVals)).^2 


primem (Ue epe el icale lared twomwelyis ice. ere) Typ i 
We ele ev anea Cm S Stota] 








println("MSE (calculated in two ways: ",deviance(lml)/dof residual(lml), 
",Nt",sum((pred.(data.Weight) - data.Height).^2)/(size(data)[1] - 2)) 





xlims = [minimum(data.Weight), maximum(data.Weight)] 
scatter(data.Weight, data.Height, c=:blue, msw=0) 
plot! (xlims, pred. (xlims), 

c=:red, xlims=(xlims), 

xlabel="Weight (kg)", ylabel="Height (cm)", legend=:none) 
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Listing 8.5: |The distribution of the regression estimators 


using DataFrames, GLM, Distributions, LinearAlgebra, Random 
using Plots, LaTeXStrings;pyplot () 
Random. seed! (0) 


beta0, betal = 2.0, 1.5 
sigma = 2.5 

n, N = 10, 10%4 

alpha = 0.05 


function coefEst () 
xVals = collect (1:n) 
yVals = beta0 .+ betal*xVals + rand (Normal (0,sigma),n) 
data = DataFrame([xVals,yVals], [:X,:Y]) 
model = lm(@formula(Y ~ X), data) 
coef (model) 








[Cockh sian O for Tin IFN] 


mean (1:n) 
= swa cba elg xe bn en 
sumas 2 for x in kaol) 
= sigma^2 * sx2/ (n*sXX) 
= sigma^2/sXX 
-sigma*2«xBar/sXX 


= [beta0, betal] 
Sigma = [var0 cv; cv varl] 


cholesky (Sigma) .L 
= inv (A) 


r = quantile (Rayleigh (),1-alpha) 
isInEllipse(x) = norm(Aix(x-mu)) <= r 
estIn = isInEllipse. (ests) 





println("Proportion of points inside ellipse: ", sum(estIn)/N) 





scatter(first.(ests[estIn]),last.(ests[estIn]),c=:green, ms=2, 


msw=0) 


scatter! (first. (ests[.!estIn]),last. (ests[.!estIn]),c=:blue, ms-2, msw=0) 





climas = leek [cos (e), Same) I] xe mua exo te ¿al O20, Oils 294 ] 
scatter! ([beta0], [betal],c=:red, ms=5, msw = 
plot! (first. (ellipsePts),last. (ellipsePts), 

c=:red, lw=2, legend=:none, 

xlabel=L"\hat{\beta}_0", ylabel=L"\hat {\beta}_1") 
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Listing 8.6: Statistical inference for simple linear regression 


using CSV, GLM, Distributions, Plots; pyplot () 


data = CSV.read("../data/weightHeight.csv") 
df = sort(data, :Weight) [1:20,:] 

model = 1lm(fformula (Height ~ Weight), df) 
pred(x) = coef(model)'*[1, x] 

tStat = coef (model) [2]/stderror (model) [2] 

io) = veniae (ol). [al 
¡vel = eere (TDLSr (im—2)) y CITAT) 

print Mc Tae a e q9eub) 

println("Manual Confidence Interval: ", 

(coef (model) [2] quantile (TDist (n-2), 0.975)*stderror (model) [2], 
coef (model) [2] + quantile (TDist (n-2), 0.975) x*stderror (model) [2]) ) 
println (model) 

















scatter (df .Weight, df.Height,c=:blue, msw=0) 
xlims = [minimum(df.Weight), maximum(df.Weight) ] 
alori oeil Ss ele elime), 
c=:red, legend-:none, xlabel = "Weight (kg)", ylabel = "Height (cm)") 





Listing 8.7: |Confidence and prediction bands 


using CSV, GLM, Distributions, Plots; pyplot() 

data = CSV.read("../data/weightHeight.csv") 

n = size(data) [1] 

model = fit (LinearModel, @formula(Height ~ Weight), data) 





alpha = 0.1 
tVal = quantile (TDist (n-2),1-alpha/2) 


1 
2 
3 
4 
5 
6 
[id 
8 


xbar = mean(data.Weight) 

Sxx = std(data.Weight) « (n-1) 
MSE = deviance (model) / (n-2) 
pred(x) = coef (model)'x[1, x] 











interval (x,sign,prediction = 0) = sign(pred(x), 
tVal > sqrt (MSEx (predictiont1/nt+(x-xbar) *2/Sxx)) ) 





xGrid = 40:1:140 

scatter (data.Weight, data.Height, c=:black, ms=2, label="") 

plot! (xGrid, pred. (xGrid), c=:red, label-"Linear model") 

plot! (xGrid, interval. (xGrid,+),c=:green, label="Confidence interval") 

plot! (xGrid, interval. (xGrid,-),c=:green, label="") 

plot! (xGrid, interval. (xGrid,+,1),c=:blue, label="Prediction interval") 

plot! (xGrid, interval. (xGrid,-,1), 
c=:blue, label="", xlims=(40, 120), ylims=(145, 200), legend=:topleft, 
xlabel = "Weight (kg)", ylabel = "Height (cm)") 
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Listing 8.8: |The Anscombe quartet datasets 


using RDatasets, DataFrames, GLM, Plots, Measures; pyplot() 
df = dataset ("datasets", "anscombe") 


modell = 
model2 
model3 
model4 


@formula(Y1 SCL) y 
@formula(Y2 32) 
G formula (Y3 ES 
formula (Y4 X21 y 


1m ( 
1m ( 
at 
1m ( 


modell 
model2 
model3 
model4 


println("Model 1. Coefficients: coef (modell), eM re 
println("Model 2. Coefficients: coef (model2), p. ia 
println("Model 3. Coefficients: coef (model3), BW S 
println("Model 4. Coefficients: coef (model4), E ue 


( )) 
( )) 
( )) 
( )) 


yHat (model, X) coef(model)’ x 
dias = 10, 201 





pl tter(df.X1, df.Y1, c=:blue, msw=0, ms=8) 
pl lot! (xlims, [yHat (modell, x) for x in xlims =:red, xlims=(xlims) ) 


p2 tter(df.X2, df.Y2, c=:blue, msw=0, ms=8) 
p2 = plot! (xlims, [yHat(model2, x) for x in xlims =:red, xlims=(xlims) ) 





ps = tter(df.X3, df.Y3, c=:blue, msw=0, ms=8) 
p3 = plot! (xlims, [yHat (model3, x) for x in xlims :red, xlims=(xlims)) 





p4 tter(df.X4, df.Y4, c=:blue, msw=0, ms=8) 
p4 plot! (xlims, [yHat (model4, x) for x in xlims c=:red, msw=0, xlims=(xlims) ) 



































jo}! Xo} eal Go)! Ninn p2; a UE (272); as OPE MET S AO 
legend=:none, xlabel = "x", ylabel = "y", 
size=(1200, 800), margin = 10mm) 




















1 using CSV, GLM, LinearAlgebra, StatsPlots, Plots, Measures; pyplot () 
2 

3 df = CSV.read("../data/weightHeight.csv") 

4 n = size(df) [1] 

5 

6 model = l1m(Gformula(Height ~ Weight), df) 

T MSE - deviance (model)/dof residual (model) 

8 pred(x) = coef (model)'x[1, x] 

9 

10 A = [ones(n) df .Weight] 

11 H = Axpinv(A) 

12 residuals = (I-H)*df.Height 

13 studentizedResiduals = residuals ./ (sqrt. (MSEx(1 .- diag(H)))) 

14 

15 tau = rank (H) 

16 Pan A eaw Y oie VY, Sa (isl) ), W xe Ye RD) 

17 

18 cookDistances = (studentizedResiduals.^2/tau) .* diag(H) ./ (1 .- diag(H)) 
19 maxCook, indexMaxCook = findmax(cookDistances) 

20 println("Maximal Cook’s distance = ", maxCook) 
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22 pl = plot ([minimum(df.Weight) ,maximum(df.Weight)], [0,0], 

23 lw = 2, c=:black, label=:none) 

24 scatter! ([df.Weight [indexMaxCook]], [studentizedResiduals [indexMaxCook]], 
25 c=:red, ms = 10, msw = 0, shape =:cross, 

26 label = "Point with maximal Cook’s distance", legend=:topleft) 
D 

28 scatter! (df.Weight, studentizedResiduals, xlabel - "Weight (kg)", 

29 ylabel = "Studentized Residual (cm)",c=:blue,msw=0, label=:none) 
30 

31 p2 = qqnorm(studentizedResiduals, msw=0, lw-2, c =[:red :blue],legend = false, 
32 xlabel="Normal Theoretical Quantiles", 

33 ylabel="Studentized Residual Quantiles") 

34 


35 plot (p1,p2,size=(1000,500),margin = 5mm) 





8.3 Multiple Linear Regression 


WITH THE EXCEPTION OF CODE 
THE CONTENT OF THIS SECTION IS OMITTED IN THE DRAFT 


Listing 8.10: |Multiple linear regression 


using RDatasets, GLM, Statistics 


df = dataset ("MASS", "cpus") 
ché. ies = maja =>10*9/5% y che .cCywet) 


model = Im(fformula (Perf ~ MMax + Cach + ChMax + Freq), df) 
pred(x) = round (coef (model) ' *vcat (1,x) digits = 3) 

prime la (Wim = Y, size (el) 111) 

println("(Avg, Std) of observed performance: ", (mean(df.Perf),std(df.Perf))) 
joxe Liege Ja (0 
println 


Estimated performance for computer A: ", pred([32000, 32, 32, 4x10^7])) 
Estimated performance for computer B: ", pred([32000, 16, 32, 6x10^7])) 


( 

( 
println (model) 

( 

( 
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Listing 8.11: |Exploring collinearity 


using Distributions, GLM, DataFrames, Random, LinearAlgebra 
Random. seed! (0) 


= 100 
beta0, betal, beta2, beta3 = 10, 30, 60, 90 
sig, eps = 25. 5 
etavals = (200.0, 100.0 , 50,0, 10.0, 1.0, Qi, 0.01, 0-01 





function createDataFrame (eta) 
Random. seed! (1) 
xl = round. (collect (1:n) + sigX*rand(Normal(),n),digits 
K2 round. (collect (1:n) + sigXxrand(Normal(),n),digits = 
x3 round. (xl +2*x2 + etax*xrand(Normal (),n) ,digits = 5) 
y = beta0 .+ betalx*x1 + beta2xx2 + beta3*x3 + sigx*rand (Normal (),n) 
return DataFrame(Y = y, X1 = xl, X2 = x2, X3 = x3) 
end 


vis = 1/22 (a (toalla (X MES XT RE 202) recla) 


for eta in etaVals 
print ("eta = $(eta): ") 
df = createDataFrame (eta) 
glmOK = true 
try 
global model = lm(Gformula(Y ~ X1 + X2 + X3),df) 
catch err 
println("Exception with GLM: ", err) 
glmOK = false 
end 





if glmOK 
covMat vcov (model) 
sigVec sqrt. (diag(covMat)) 
corrmat = round.(covMat ./ (sigVec«sigVec'),digits-5) 
acia Mor (xb xs = Wo cora 12,41, 
WW Coie (022,23) = V eorne la, 41, 
U Nue Wiss OU VERSO Cas 2) 0) 
else 
println("Nt In this case we may use SVD or ridge regression if needed.") 
end 
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8.4 Model Adaptations 


WITH THE EXCEPTION OF CODE 
THE CONTENT OF THIS SECTION IS OMITTED IN THE DRAFT 


Listing 8.12: |Linear regression of a polynomial model 


using CSV, GLM, Plots; pyplot () 


data = CSV.read("../data/polynomialData.csv") 
data.X2 = abs2. (data.X) 

modell = 1lm(@formula(Y ~ X + X2), data) 
model2 = l1m(G8formula(Y ~ XxX), data) 

println (modell) 

println (model2) 


yHat(x) = coef(modell)[1] + coef(modell)[2]*x + coef (modell)[3]*x^2 


Kecel = 550.185 
scatter(data.X, data.Y, c=:blue, msw=0) 
plo! (eric Eat. (deal) y, 
xlabel-"X", ylabel-"Y", c=:red, legend=:none) 





using CSV, GLM, Plots, Random; pyplot () 
Random. seed! (0) 


= CSV.read("../data/weightHeight.csv", copycols = true) 
n = size(df) [1] 
dt iehurt let tsnly >) =| 5E 
ete [ (10, 40, 60, 130, 140,175,190, 200] 5562] s= "OL" 
eit [ [97 44, 63, 132, 133, 172,192, 199], 288%] ge "OZ" 


model = lm(@formula (Height ~ Weight + Sex), df, 

contrasts=Dict (:Sex=>DummyCoding (base="F", levels=["M","01","02","F"]))) 
90, Jodi. 192, Jas, lol. COC (models) 
pred (weight,sex) = b0+blx*weight+b2x* (sex=="M") +b3* (sex=="01")+b3x* (sex=="02") 
println (model) 


males = df [df.Sex .== "M",:] 
females = df [df.Sex .== "F",:] 
other = df[(df.Sex .!= "M") .& (df.Sex .!= "F"),:] 


xlim = [minimum(df.Weight), maximum(df.Weight) ] 
scatter (males.Weight, males.Height, c=:blue, msw=0, label="Males") 
plot! (xlim, pred. (xlim,"M"), c=:blue, label="Male model") 





scatter! (females.Weight, females.Height, c=:red, msw=0, label="Females") 
plo! (lim, prec (Gálim, Vie") , 

c=:red, label="Female model", xlims=(xlim), 

xlabel="Weight (kg)", ylabel="Height (cm)", legend=:topleft) 





scatter! (other.Weight, other.Height, c=:green, msw=0, label="Other") 
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Listing 8.14: |Regression with categorical variables - with interaction effects 


using CSV, GIM, Plots, Random; pyplot () 
Random. seed! (0) 


= CSV.read("../data/weightHeight.csv", copycols = true) 
n = size(df) [1] 
cl || sauces le (Mea), sl = che 
Ge || (10,40, 60, 130, 140, 175, 190, 2001, ¿Sel .= "O" 


conts = Dict (:Sex=>DummyCoding (base="F",levels=["F","M","0"]) ) 

modell = lm(Gformula(Height ~ Weight * Sex), df, contrasts=conts) 

model2 lm(Gformula(Height - Weight & Sex), df, contrasts-conts) 

model3 1m((formula (Height ~ Weight + Weight & Sex), df, contrasts=conts) 


a0, al, a2, a3, a4, a5 = coef (modell) 
ISO, lod, 192, Iasi coef (model2) 
CU, Ci, xA. (e coef (model3) 


println (modell) 
println(model2) 
println (model3) 
println("Model2 and Model3 equivalence: ", 
sewel (ted = GO, Il = el, (O = (elsce3)),guguits-5)) 








predl (weight,sex) = a0 + al * weight + 
(sex=="M")x (a2+a4xweight) + 
(sex=="0")x* (a3+a5*weight) 
pred2 (weight,sex) = bO + weightx (blx (sex=="F") + b2x(sex=="M") + b3x (sex=="0")) 





males, females, other=df [df.Sex .--"M",:],df[df.Sex .--"F",:],df[df.Sex .--"O",:] 


xlim = [0, maximum(df.Weight) ] 

scatter(males.Weight, males.Height, c=:blue, msw=0, label="Male") 

plot! (xlim, predl.(xlim,"M"), c=:blue, lw = 1.5, label="Male fit 1") 

plot! (xlim, pred2.(xlim,"M"), c=:blue, linestyle=:dash, label="Male fit 2") 

scatter! (females.Weight, females.Height, c=:red, msw=0, label="Female") 

plot! (xlim, predl.(xlim,"F"), c=:red, lw=1.5,label="Female fit 1",xlims=(xlim) ) 

plot! (xlim, pred2.(xlim,"F"), c=:red, linestyle=:dash, label="Female fit 2") 

scatter! (other.Weight, other.Height, c=:green, msw=0, label-"Other", 
xlabel="Weight (kg)", ylabel="Height (cm)", legend=:topleft) 




















1 using CSV, GLM, Plots; pyplot() 

2 

3 Che = ESV ica. yema LORS esi") 

4 groupA = df[df.Group .== "A", :] 

5 groupB = df[df.Group .== "B", :] 

6 groupe ele lid fa Groupies —— "EW, 81 

7 

8 model = fit(LinearModel, @formula(AlcConsumption ~ IQ), df) 

9 modelA = fit (LinearModel, @formula(AlcConsumption ~ IQ), groupA) 
10 modelB = fit(LinearModel, @formula(AlcConsumption ~ IQ), groupB) 
11 modelC = fit (LinearModel, @formula(AlcConsumption ~ IQ), groupC) 
12 

13 ied = cosi model il, = 











14  predA(x) = coef(modelA)' * [1, x 


312 CHAPTER 8. LINEAR REGRESSION AND EXTENSIONS - DRAFT 


























15 predB(x) = coef(modelB)' x* [1, x] 

16 predC(x) = coef(modelC)' * [1, x] 

17 

18 xlims = collect (extrema (df.IQ)) 

19 

20 pl = scatter(df.IQ, df.AlcConsumption, c=:black, msw=0, ma=0.2, label="") 

21 plot! (xlims, pred. (xlims), c=:black, label="A11 data") 

22 

23 p2 = scatter (groupA.IQ, groupA.AlcConsumption, c=:blue, msw=0, ma=0.2, label="") 
24 scatter! (groupB.IQ, groupB.AlcConsumption, c=:red, msw=0, ma=0.2, label="") 
25 scatter! (groupC.I0, groupC.AlcConsumption, c=:green,msw=0, ma=0.2, label="") 
26 plot! (xlims, predA. (xlims), c=:blue, label="Group A") 

Du plot!(xlims, predB.(xlims), c=:red, label-"Group B") 

28 plot!(xlims, predC.(xlims), c=:green, label="Group C") 

29 

30 plot(pl, p2, xlims-(xlims), ylims-(0,1), 

31 xlabel-"IQ", ylabel-"Alcohol Metric", size-(800,400)) 





8.5 Model Selection 


WITH THE EXCEPTION OF CODE 
THE CONTENT OF THIS SECTION IS OMITTED IN THE DRAFT 


Listing 8.16: |Basic model selection 








1 using StatsModels, RDatasets, DataFrames, GLM, Random 

2 Random. seed! (0) 

3 

4 n = 30 

5 df = dataset ("MASS", "cpus")[1:n,:] 

6 elit Mies = WS :=>10*9/% , cl. Cien) 

7 che = ers; ||gPemer, sMMEox, 2Eacla, seb, seee] | 

8 df.Junk1 = rand(n) 

9 df.Junk2 = rand(n) 

10 

11 function stepReg(df, reVar, pThresh) 

12 predVars = setdiff (propertynames (df), [reVar]) 

13 numVars = length (predVars) 

14 model = nothing 

15 while numVars > 0 

16 fm = term(reVar) - term.((1, predVars...)) 

17 model (an sce) 

18 pVals = coeftable (model) .cols[4] [2:end] 

19 println("Variables: ", predVars) 

20 println("P-values = ", round. (pVals,digits = 3)) 
21 pVal, knockout = findmax(pVals) 

Do pVal « pThresh && break 

23 println("NtRemoving the variable ", predVars[knockout], 
24 " with p-value = ", round(pVal,digits-3)) 
25 deleteat! (predVars,knockout) 

26 numVars = length (predVars) 

27 end 

28 model 

29 end 
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31 model = stepReg(df, :Perf, 0.05) 
32 println (model) 
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Listing 8.17: [Using LASSO for model selection 


using RDatasets, DataFrames, Lasso, LaTeXStrings, Plots, Measures; pyplot () 


df = dataset ("MASS", "cpus") 
di Ereg = mas =>10%9/% , cle Cien) 
= gis, ES sme, sua, sie, Stel, E CHMI sls] 
X = [df.Freq df.MMin df.MMax df.Cach df.ChMin df.ChMax] 
p df.Perf 


targetNumVars - 3 


lambdaStep - 0.2 
lamGrid = collect (0:lambdaStep:150) 
lesson. = iit (MASSOReicli,, Y, A = lLemticlo!) p 
dd = Array (lassoFit.coefs)’ 
= sum(dd .!= 0.0 ,dims=2) 


goodLambda = lamGrid[findfirst ((n)-»n--targetNumVars,nV)] 

newFit = fit(LassoPath,X, Y, A = [goodLambda - lambdaStep, goodLambda] ) 
println(newFit) 

println("Coefficients: ", Array (newFit.coefs)’ [2,:]) 


= plot (lassoFit.A, dd, label = ["Freq" "MMin" "MMax" "Cach" "ChMin" "ChMax"], 
ylabel = "Coefficient Value") 
plot! ([goodLambda, goodLambda], [-1,1.5],c=:black, lw=2, label = "Model cut-off") 





p2 = plot (lassoFit.A,nV, ylabel = "Number of Variables", legend = false) 
plot! ([goodLambda, goodLambda], [0,6],c=:black, lw=2, label = "Model cut-off") 








plot (p1,p2,xlabel= L"\lambda", margin = 5mm, size = (800,400) ) 





8.6 Logistic Regression and the Generalized Linear Model 


WITH THE EXCEPTION OF CODE 

THE CONTENT OF THIS SECTION IS OMITTED IN THE DRAFT 
Listing 8.18: [Logistic regression 

using GLM, DataFrames, Distributions, Plots, CSV; pyplot () 
data = CSV.read("../data/examData.csv") 


model = glm(@formula(Pass ~ Hours), data, Binomial(), LogitLink()) 
println (model) 


b0, b1 = coef (model) 
pred(x) = 1/(1l+exp(-(b0 + blx*x))) 





OANDOKRWNEH 


xGrid = 0:0.1:maximum(data.Hours) 
scatter(data.Hours, data.Pass, c=:blue, msw=0) 
plot! (xGrid, pred. (xGrid), 
c=:red, xlabel="Hours studied", legend=:none, 
xlims=(0, maximum(data.Hours)), yticks=([0:1;], ["Fail", "Pass"])) 
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Listing 8.19: Exploring generalized linear models 


using GLM, RDatasets, DataFrames, Distributions, Random, LinearAlgebra 
Random. seed! (0) 


df = dataset ("MASS", "cpus") 
= size(df) [1] 
= dt [shutrttlelLin), = | 


pTest = 0.2 
lastTindex = Int(floor (nx (1-pTest))) 
numTest = n - lastTindex 


train = df[1:lastTindex, :] 
test = df[lastTindex+1:n,:] 


form = (formula (Perf-CycT+MMin+MMax+Cach+ChMin+ChMax) 
modell = glm(form, train, Normal(), IdentityLink()) 
model2 = glm(form, train, Poisson(), LogLink()) 
model3 glm(form, train, Gamma(), InverseLink () ) 


invIdenityLi 
invLogLink 
invInverseLi 








A = [ones(numTest) test.CycT test.MMin test.MMax test.Cach test.ChMin test.ChMax] 
predl = invIdenityLink. (A*coef (modell)) 

pred2 = invLogLink. (A«coef (model2)) 

pred3 = invInverseLink. (Axcoef (model3)) 





actual = test.Perf 

lossModell = norm(predl - actual) 
lossModel2 norm(pred2 - actual) 
lossModel3 norm(pred3 - actual) 


println("Model 1: ", coef (modell)) 
println("Model 2: ", coef (model2)) 
println("Model 3: ", coef (model3)) 
println("NnLoss of models 1,2,3: ", (lossModell ,lossModel2, lossModel3)) 





316 CHAPTER 8. LINEAR REGRESSION AND EXTENSIONS - DRAFT 


8.7 A Taste of Time Series and Forecasting 


WITH THE EXCEPTION OF CODE 
THE CONTENT OF THIS SECTION IS OMITTED IN THE DRAFT 


Listing 8.20: |Exploratory data analysis of a time series 


using CSV, TimeSeries, Dates, Statistics, Measures, Plots, StatsPlots; pyplot () 





CSV.read("../data/oneOnEpsilonBlogs.csv",copycols = true) 


TimeArray (Date. (df.Day,Dates.DateFormat ("m/d/y")),df.Users) 
tsB = moving(mean,tsA,7,padding = true) 
tsC = TimeArray (timestamp (tsA), values(tsA) - values (tsB) ) 





dow = dayofweek. (timestamp (tsA) ); 
dayDiv = [filter((x)->!isnan(x) && x >= -50 && x <= 50, values(tsC) [dow .== d]) 
for d in 1:7] 





start2020 findfirst ( (d) ->Year (d) ==Year (20) ,timestamp (tsA) ) 
indexLast = length(tsA) 
indexes2020 = start2020:indexLast 


dayGroupl = [7] #Sun 

dayGroup2 [1,6] #Mon, Sat 

dayGroup3 [2,3,4,5] #Tue, Wed, Thu, Fri 
dayGroups [dayGroupl, dayGroup2, dayGroup3] 
groupDivs = [vcat (dayDiv[g]...) for g in dayGroups] 





default (legend = :topleft) 

labels = ["Daily" "7 Day Average"] 

¡al plot(1:indexLast, [values(tsA) values(tsB)], c=[:blue :red],label = labels, 
xlabel = "Day", ylabel = "Daily Users", ylim = (0,200)) 





p2 plot (indexes2020, [values(tsA) [indexes2020] values (tsB) [indexes2020]], 
c=[:blue :red], label=labels, xlabel="Day in 2020", 
ylabel="Daily Users",ylim=(0,200)) 





p3 plot (1:indexLast, values(tsC),label = labels, c=:black, 
xlabel="Day",ylabel="Variation",ylim=(-50,50),legend=false) 





p4 plot (indexes2020, values (tsC) [indexes2020],label = labels, c=:black, 
xlabel="Day in 2020",ylabel="Variation",ylim=(-50,50),legend=false) 


dayNames = dayname. (timestamp (tsA)[4:10]) 
= density(dayDiv, label = hcat(dayNames...),legend = :topright, 
xlabel = "Variation", ylabel = "Frequency",) 


dayGroupNames = ["Sun", "Mon+Sat", "Tue+Wed+Thu+Fri"] 
p6 = density (groupDivs, label = hcat (dayGroupNames...), legend = :topright, 
xlabel = "Variation", ylabel = "Frequency",c=[:blue :red :green]) 





¡ARA AA A 33; oT 257 A A AA EA E (C00 AI merscain = San) 
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Listing 8.21: |Using linear regression for forecasting in a time series 





using CSV,DataFrames, Dates, GLM, Statistics, LinearAlgebra,Measures,Plots; pyplot () 





CSV.read("../data/oneOnEpsilonBlogs.csv",copycols = true) 
len = size(df) [1] 


dow = dayofweek. (Date. (df.Day,Dates.DateFormat ("m/d/y") ) ) 

dave coup a LEY lil,6l, 12,3,4, 5] 

inds = [[in(d,grp) for d in dow] for grp in dayGroups] 

df2 = DataFrame (Time=1:len, Users=df.Users, 
Groupl=inds[1],Group2=inds [2] ) 








trainRangel, futureRangel = 100:130, 130:180 
trainRange2, futureRange2 2007300, 3002320 
trainRange3, futureRange3 400:500, 500:600 
trainRange4, futureRange4 560:600, 6002630 


function forecast (trainRange, futureRange) 
model = lm(fformula(Users ~ Time + Groupl + Group2),df2[trainRange, :]) 
pred (time) = dot (coef (model), [1,time,inds[1] [time], inds[2] [time] ]) 
model, pred. (trainRange), pred. (futureRange) 

end 


function forecastPlot (train, future) 
p = plot (train[1], df.Users[train[1]], label = "Observed Users", 
xlabel = "Day", ylabel = "Daily Users", c=:blue) 
plot! (train[1l], train[2], label = "Train", c=:red) 
plot! (future[1], future[2], label = "Forecast", c=:magenta) 
plot! (future[1], df.Users[future[1]], label = "Actual Users", c=:green) 
return p 
end 





modell, trainl, fcstl = forecast (trainRangel, futureRangel) 
println (modell) 


default (legend = :topleft) 
pl = forecastPlot ((trainRangel, trainl), (futureRangel, 





., train2, fcst2 = forecast (trainRange2, futureRange2) 
default (legend = false) 
p2 = forecastPlot((trainRange2, train2), (futureRange2, 


., train3, fcst3 = forecast (trainRange3, futureRange3) 
p3 = forecastPlot ((trainRange3, train3), (futureRange3, 


., train4, fcst4 = forecast (trainRange4, futureRange4) 
p4 = forecastPlot ( (trainRange4, train4), (futureRange4, 
plot (p1, p2, p3, p4, size = (900,600), margin = 5mm) 
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Listing 8.22: |Differencing, autocorrelation and a correlogram of a time series 


using CSV, TimeSeries, Dates, Statistics, StatsBase, Measures, Plots; pyplot() 





CSV.read("../data/oneOnEpsilonBlogs.csv",copycols = true) 
= TimeArray (Date. (df.Day,Dates.DateFormat ("m/d/y")),df.Users) 
moving (mean,tsA,7,padding = true) 
= TimeArray(timestamp(tsA), values(tsA) - values (tsB) ) 





= dayofweek. (timestamp (tsA) ) 
dayDiv = [filter((x)->!isnan(x), values(tsC) [dow .== d]) for d in 1:7] 
dayMeans = mean. (dayDiv) 
dailyCorrection = dayMeans [dow] 
errs = filter ((x)->!isnan(x), values (tsC) - dailyCorrection) 
CS (Cosas) 





lags = 1:50 
acc = autocor (diffs,lags) 


default (legend = false) 
pl = plot (diffs, c=:blue, 
xlabel="Day",ylabel="Difference Between Corrected Days") 
p2 = plot (lags,acc, line=:stem, c=:blue, 
xlabel="Lag", ylabel="Autocorrelation") 
plot (p1,p2,size=(900,400) margin = 5mm) 











Chapter 9 


Machine Learning Basics - DRAFT 


The previous chapters covered some of the key concepts of classical statistics. These include 
point estimation, confidence intervals, hypothesis testing, and regression analysis. With such tasks, 
the focus is often on the model, its properties, and interpretation of inference. However, paradigms 
arising in computer science introduce additional data focused tasks, some of which are related to 
classical statistical tasks and others new. These are the focus of this chapter. We now focus on 
methods that are generally considered to fall within the realm of machine learning. 


There is not a clear boundary between statistics and machine learning and hence there are 
many competing terms. Some of these include statistical learning often employed in the statistics 
community, as well as the related terms data mining and even artificial intelligence. These terms are 
used in computer science and business contexts. Further, one key paradigm that has emerged within 
machine learning in the past decade is deep learning, an area which deals with creating machine 
learning models that involve multi-layer neural networks. Such methods often yield models that 
integrate well into other automatic systems including apps for both personal and business use. 


In this chapter we present an overview of several machine learning methods ranging from classical 
methods to deep learning. We clearly cannot cover everything that exists in the machine learning 
world because such content will require multiple chapters or even multiple books. Our purpose 
here is merely to present the reader with a taste of the associated problems and methods from the 
field. We hope that by considering the 20 code listings of this chapter together with the associated 
descriptions, the reader may begin to get a taste for the nature of machine learning. For further 
reading dealing with some theoretical aspects of machine learning, mixed with (Python) code we 
recommend and references there-in. Further, a very popular reference focusing almost 
entirely on deep learning is [GBC16]. A recent Julia centric book discussing some aspects of machine 
learning is [V20]. Finally a book which deals with both classical statistics, computational methods 
in statistics, and machine learning is [EH16]. 


Julia has dozens of packages related to machine learning and here we only use a few. The 
main deep learning library is Flux.31 and indeed this chapter makes quite a bit of use of the 
Flux package. Further there are frameworks such as MLJ.31 (Machine Learning Julia) and an 
adaptation of Python’s scikit-learn via the ScikitLearn. 41 Julia package. We don't use MLJ or 
ScikitLearn per-se, but recommend that the reader investigate them independently. 
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Machine learning problems, tasks, and activities can be dichotomized into several categories. 
Some of the major categories include: 


Supervised learning - This suite of problems deals with a situation similar to the regression prob- 
lems of Chapter Data is available in the form of (x;,y;) and the goal is to learn how to 
predict Y based on X. In this sense, the regression analysis methods of Chapter |8| already 
serve a basis for machine learning. However in machine learning, each x; is often very high 
dimensional. The elements of x; are called features and the variable y; is referred to as the 
label. When the labels are only 0 or 1, or come from a finite discrete set, the supervised 
learning problem is called a classification problem. As opposed to that, when the labels are 
continuous as was in most of Chapter [8] this is called a regression problem. 


Unsupervised learning - In this case there are only features X but no labels Y. That is, there 
isn’t any label marking X. Think for example of a baby that learns about the world without 
receiving explicit feedback and direction. In basic machine learning, one important task is 
creating clusters of points and recognizing the clusters. This falls in the realm of clustering 
algorithms. Another task is reducing the dimension of the data, i.e. data reduction. 


Reinforcement learning - In this case an agent makes decisions dynamically over time, aiming to 
maximize some objective or achieve desired behavior. In certain cases the agent has some 
knowledge about the way the world responds to the decisions, but this knowledge is often 
lacking or very partial. The methods of reinforcement learning allow to control such systems 
in a near optimal manner. Notable examples include playing games such as chess or Alpha-Go. 
Other examples include playing an unknown video game against a computer and eventually 
improving. Practically there are often applications in robotics related to reinforcement learn- 


ing. See for example the (now classic) website http: //heli.stanford.edu/ 


Generative models - This is the task of observing data similar to the unsupervised case and creating 
a model that can then create additional similar data. One general application of these types of 
models is the deep fake technology where images and movies can be modified to look differently 
yet appear natural. For example one face can be implanted on another. Such technology is 
not necessarily positive and has been put to some negative use in recent years. Nevertheless, 
understanding the basics of how it works is important. 


This chapter overviews specific methods for each of the above tasks with simple examples. We 
mainly use a very basic classic dataset, MNIST. We begin with Section where we introduce 
some basic concepts of machine learning, mostly related to supervised classification but also dealing 
with stochastic optimization techniques that are common in machine learning. In Section we 
present several concrete methods for supervised classification. These include basic least squares, 
logistic regression, support vector machines, random forests, and deep learning. We continue with 
Section .3] where we present the concept of regularization, focusing on ridge regression optimized via 
cross validation and dropout in deep learning. In Section [9.4] we explore some unsupervised learning 
techniques including clustering and Principle Component Analysis (PCA). Then in Section we 
explore the basics of Markov decision processes and reinforcement learning. We close with Section|9.6] 
where we briefly demonstrate how to train and use Generative Adversarial Networks (GANs). 
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9.1 Training, Testing and Tricks of the Trade 


WITH THE EXCEPTION OF CODE 
THE CONTENT OF THIS SECTION IS OMITTED IN THE DRAFT 
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Listing 9.1: [Using a pre-trained neural network for classification 


using Flux, Flux.Data.MNIST, Statistics, BSON, StatsBase, Plots; pyplot () 
using Flux: onecold 


model= Chain(Conv((5, 5), 1=>8, relu), MaxPool((2,2)), 
Comvi (3s, 3), e==16, relu), MaxPool (2,2) y 
flatten, Dense(400, 10), softmax) 


BSON.@load "../data/mnistConv.bson" modelParams 
Flux.loadparams! (model, modelParams) 


function predictor (img) 
when = ones (Float32,28,28, 1, 1) 
maca (8, spl i = wileaics2. (mej) 
onecold (model (when) , 0:9) [1] 

end 


testLabels = Flux.Data.MNIST.labels(:test) 
testImages = Flux.Data.MNIST.images(:test) 
nTest = length (testLabels) 








iC, ak = (0, 0 
nCorrect = 0 
goodExamples = zeros (Int,10) 
badExamples = zeros (Int, 10) 
predictedBad = zeros (Int,10) 
for inn Tes 
prediction = predictor (testImages[i]) 
trueLabel = testLabels[i] 
predictionIsCorrect = (prediction == trueLabel) 
global nCorrect += predictionIsCorrect 
global iC; global iR 
if predictionIsCorrect && trueLabel == iC 
goodExamples[iC+1] = i 
ac 4 1 
end 
if !predictionIsCorrect && trueLabel == iR 
badExamples[iR+1] = i 
predictedBad[iR+1] = prediction 
iR += 1 
end 
end 























println("Percentage correctly classified: ", 100*nCorrect/nTest) 


default(yflip = true, size = (1000,300), 
legend-false,color = :Greys,ticks=false) 

pl = heatmap (hcat (float. (testImages [goodExamples])...)) 
p2 heatmap (hcat (float. (testImages [badExamples])...)) 
for a in alo 

annotate! (281-3,25,text ("$ (predictedBad[i])",18)) 
end 
plo (jell, 2, loo UE? (2, i) 9) 
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Listing 9.2: |Attempting hand crafted machine learning 


using MLDatasets, StatsBase, Measures, Plots; pyplot () 


Train, yTrain = MLDatasets.MNIST.traindata (Float32) 
Test, yTest = MLDatasets.MNIST.testdata (Float32) 
Train, nTest = size(xTrain) [3], size(xTest) [3] 
ct io [E 5525-1) REO de she 1 gue] 
testData = [xTest[|3,:,kl’ for k in l:nresti 
positiveTrain = trainData[yTrain .== 1] 
negativeTrain = trainData[yTrain .!= 1] 

testLabels = yTest .== 1 





«oo-1oc»o0cumÉ ctv. 














PRR 
Or c 


function peakProp (img) 
peakSum = 0.0 
for In eas 
m = argmax(img[j,:]) 
(m <=2 || m >= 26) && continue 
peakSum += sum(img[j,m-2:m-*2]) 


PRR RP Re 
CONDO BK 0 


end 
peakSum/sum (img) 


NAS 
e «do 


end 
predict(img,theta) = peakProp(img) <= theta ? false : true 
function Flvalue (theta) 
predictionOnPositive = predict. (positiveTrain, theta) 
predictionOnNegative = predict. (negativeTrain, theta) 
TP = sum(predictionOnPositive) 
FN sum(1 .- predictionOnPositive) 
FP sum(predictionOnNegative) 
TN = sum(1 .- predictionOnNegative) 
reca NPR precio ioni = WP/ (ME xb WIN), due (aue 3e 712) 
return 2» (precisionx*recall)/(precision+recall) 


N NN 
wW Ne 





N) NN 
IDAR 





h2 
00 





w C9 Ny 
= oo 


end 


wo 
N 





ow 
[v 


psPositive, psNegative = peakProp.(positiveTrain), peakProp. (negativeTrain) 
thetaRange = 0.5:0.005:1 
flValues = Flvalue. (thetaRange) 


w w w w 
NOOO 


bestFl1, bestIndex = findmax(flValues) 
bestTheta = thetaRange[bestIndex] 
println("Best theta = ", bestTheta, " with Fl value of ", bestF1) 





Hx ES Go Co 
= C © 00 


println("On test set:") 

testPredictions = predict. (testData,bestTheta) 

TP, FN = sum(testPredictions[testLabels]), sum(.!testPredictions[testLabels]) 

FP, TN = sum(testPredictions[.!testLabels]), sum(.!testPredictions[.!testLabels] ) 
recall, pracision = WP/ (INE + IN), US (Ge < EFE) 

Fltest = harmmean([precision, recall] ) 

@show TP, FN, FP, TN; @show recall, precision; (show Fltest 


oR 
& do N 





IS 
O úl 





SS 
© ON 


pl = stephist (psPositive, normed = true, label="1 Digit",bins=50) 
stephist! (psNegative, normed = true, xlim=(0.4,1), ylim=(0,6),bins=50, 
xlabel = "Value", ylabel = "Frequency", 
label="Non 1 Digit") 
plot! ([bestTheta,bestTheta], [0,5], c =:black, label = :none) 
p2 = plot (thetaRange, flValues, legend = false, 
xlabel = "Threshold", ylabel = "F1 Value") 
plot! ([bestTheta], [bestF1], c=:black) 
plot (p1,p2, size=(800, 400) ) 





ot Ot OT c 
WNe © 





at al 
oe 








or al 
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Listing 9.3: Using Flux.j1 and ADAM for optimization 


using Flux, Random, LinearAlgebra, CSV 
using Flux.Optimise: update! 
Random.seed! (0) 


data OSA ai lara Mio dato es) 
xVals, yVals = Array{Float64} (data.X), Array{Float64} (data.Y) 


aOaIow1Rwner 


eta = 0.05 
epsilon = 10^-7 


rand (2) 


predica ase ees 
loss (E) = sum (y .= predicar (=) ) o 2) 
opt = ADAM(eta) 


iter, gradNorm = 0, 1.0 
while gradNorm >= epsilon 
gs = gradient (()->loss (xVals,yVals), params (b) ) 
update! (opt,b,gs[b]) 
gradNorm = norm(gs[b]) 
global iter += 1 
end 


println("Number of iterations: ", iter) 
prin til e oeste E. T9) 
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Listing 9.4: Using SGD for least squares 


using Random, Distributions, Plots, Measures, LaTeXStrings; pyplot () 
Random. seed! (1) 


= 10^3 
logia betal Sica = 240, 5D, 20D 
eta = 10^-3 


Oo -q1Oo cU R0 rb. -2 


nWale = ramo (050.01 8 5, im) 
yVals beta0 .+ betal*xVals + rand(Normal(0,sigma),n) 


pes, a = s O 0] 
push! (pts,b) 
fors ón Isis 
i = rand(1:n) 
g el Ajo] s legi ssweds[ a -swels (11), 
2 Eat T es e EET] sr Je |Z i) eseveuls [E at || weil 1301) | 
global b -= etaxg 
push! (pts,b) 


plot (first. (pts),last. (pts), c=:black,1w=0.5,label="SGD path") 
scatter! ([b[1]], [b[2]],c=:blue,ms=5, label="SGD") 
scatter! ([beta0], [betal], 

c=:red,ms=5,label="Actual", 

xlabel=L"\beta_0", ylabel=L"\beta_1", 

ratio=:equal, xlims=(0,2.5), ylims=(0,2.5)) 








scatter (xVals,yVals, c=:black, ms=1, label="Data points") 

PLA SEA bile sabe —" Se) 

plot! ([0,5], [beta0,beta0+5xbetal], c=:red, label="Actual", 
xlims=(0,5), ylims=(-5,15), xlabel = "x", ylabel = "y") 


plot(pl, p2, legend=:topleft, size=(800, 400), margin = 5mm) 
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9.2 Supervised Learning Methods 


WITH THE EXCEPTION OF CODE 
THE CONTENT OF THIS SECTION IS OMITTED IN THE DRAFT 


Listing 9.5: [Linear least squares classification 


using Flux, Flux.Data.MNIST, LinearAlgebra 
using Flux: onehotbatch 


imgs = Flux.Data.MNIST.images() 

labels = Flux.Data.MNIST.labels() 

nTrain = length (imgs) 

ERCUINDAEA = wee laca (loci. (ads ll) 53) Here a sa ientreadial ooo) 
trainLabels = labels[1:nTrain] 

testImgs = Flux.Data.MNIST.images(:test) 

testLabels = Flux.Data.MNIST.labels(:test) 

nTest = length(testImgs) 

testar ca (incall lot EEst maS NAAA O rn ne sie 





Oo -10» C' 4 05 t5 


A — [ones(nTrain) trainData] 

Adag = pinv(A) 

ipM) = x E wd 8 d 

yDat(k) = tfPM. (onehotbatch (trainLabels,0:9)’ [:,k+1]) 
bets = [Adag*yDat (k) for k in 0:9] 


ASS (anole) = ns (|| (| Sis Iki] Eor ke syn dtd» [21-3 


predictions = [classify(testData[k,:]) for k in 1:nTest] 

confusionMatrix = [sum((predictions .-- i) .& (testLabels .== j)) 
forno oS CES] 

accuracy = sum(diag(confusionMatrix))/nTest 


¡heal (vuACcUracy Y, teXerelbhere reo acoja E Usi on Mat rag) 
show(stdout, "text/plain", confusionMatrix) 
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Listing 9.6: Logistic softmax regression for classification 


using Flux, Flux.Data.MNIST, Statistics, BSON, Random, StatsBase, Plots; pyplot () 
using Flux: params, onehotbatch, crossentropy, update! 
Random. seed! (0) 


nTrain = 20000 

miniBatchSize = 1000 

imgs = Flux.Data.MNIST.images() [1:nTrain] 
labels = Flux.Data.MNIST.labels() [1:nTrain] 





«oo-1oc»oc0cumÉcotr-z- 





m 
e 


Erelimibara = Inoare wies (Elles (Gba al) su.) mor sin dbewütregdum] a. o) 
trainLabels - labels 


Rth 
WN HR 


testImgs = Flux.Data.MNIST.images(:test) 

testLabels = Flux.Data.MNIST.labels(:test) 

nTest = length (testImgs) 

testData = hcat ([vcat (float. (testImgs[i])...) for i in l:nTest]...) 


Rie 
o eA 





menm 
oo -1 0 


W = randn(10,28x28) 
b = randn(10) 


N NR 
=. O o 


logisticM(imgVec) = softmax(W*ximgVec .+ b) 
logisticMclassifier(imgVec) = argmax(logisticM(imgVec))-1 
loss(x,y) = crossentropy (logisticM(x),onehotbatch(y,0:9)) 
opt = ADAM(0.01) 


h2 h2 h2 h2 h2 
Doe 0 Lh 


lossValue = 0.0 
lossArray = [] 
epochNum = 0 
while true 
global lossValue 
prevLossValue = lossValue 
for batch in Iterators.partition(1:nTrain,miniBatchSize) 
gs = gradient (()->loss(trainData[:,batch],trainLabels[batch]),params (W,b) ) 
for p in (W,b) 
update! (opt,p,gs[pl) 
end 
end 
global epochNum += 1 
lossValue = loss(trainData,trainLabels) 
push! (lossArray, lossValue) 
prin 
abs (prevLossValue-lossValue) < 5e-4 && break 
end 


EEE O9 Q2 C9 Qo Q2 C2 C9 C9 b2 b2 bh2 
CY 4 OQ») ho — CO dO O0 aa Qo P2 — OO OY 


println("NnNumber of epochs: ", epochNum) 

acccuracy = mean([logisticMclassifier(testData[:,k]) for k in 1:nTest] 
.== testLabels) 

prime iba PMAccuracya ae eur ale) 


ES 
eio0-1595 








plot(lossArray, xlabel Epoch", ylabel = "Cross Entropy Loss", legend = false) 





328 CHAPTER 9. MACHINE LEARNING BASICS - DRAFT 


Listing 9.7: Support vector machines 


using Flux.Data.MNIST, LIBSVM, Plots 


logFilePath = "../data/svmlog.txt" 
nTrain = 10^4 


trainImgs = MNIST.images() [1:nTrain] 
trainLabels MNIST.labels()[1:nTrain] 
crias = exu (vease (loa. (exegusdnege 31.555) Eos ab La Lemtrzradal ass) 








V0-JDOOASNNEA 


m 
=] 


testImgs = MNIST.images(:test) 

testLabels = MNIST.labels(:test) 

nTest = length (testImgs) 

Hex = ica (mece (frlost. (ESSE Mame SL) ums) ¿xs 3b a domes sec 


Roe 
RO = 





eRe 
oR w 


Qinfo "Training model with verbose output to SlogFilePath." 
@time begin 
sOut = stdout 
logF = open(logFilePath, "w") 
redirect_stdout (logF) 
model = svmtrain(trainData, trainLabels, 
kernel = Kernel.Linear, verbose=true) 


NNNFPRP RB 
Ne C © oo ~J 


close (logF) 

redirect_stdout (sOut) 

@info "Training complete." 
end 


B2 h2 h2 h2 Ww 
TO» Ot 0 


predicted labels, _ = svmpredict (model, testData) 





N h2 
© 00 





accuracy = sum(predicted labels .== testLabels)/nTest 
println("Prediction accuracy (measured on test set of size SnTest): ", accuracy) 


w 
=] 
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Listing 9.8: Random forest 


using Flux.Data.MNIST, DecisionTree, Random 
Random. seed! (0) 


trainImgs = MNIST.images () 

trainLabels = MNIST.labels() 

nTrain = length (trainImgs) 

crcimbDara = weet (mee (loa. (eraiminmMgs [313 sa) tere L Sin Jb gratum] oo.) 





testImgs = MNIST.images(:test) 

testLabels = MNIST.labels(:test) 

nTest = length (testImgs) 

ires aim — SR (tioata E A q. Engl miles AA 





numFeaturesPerTree = 10 
numTrees = 40 
portionSamplesPerTree = 


maxTreeDepth = 10 





model = build forest(trainLabels, trainData, 
numFeaturesPerTree, numTrees, 
portionSamplesPerTree, maxTreeDepth) 

println("Trained model:") 

println (model) 





predicted labels = [apply forest (model, testData[k,:]) for k in 1:nTest] 
accuracy = sum(predicted labels .== testLabels) /nTest 
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println("\nPrediction accuracy (measured on test set of size $nTest): ",accuracy) 
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Listing 9.9: Training dense and convolutional neural networks 


using Flux, Flux.Data.MNIST, Statistics, BSON, Random, Plots; pyplot() 
using Flux: onehotbatch, onecold, crossentropy 
Random. seed! (0) 


epochs = 30 

eta = 5e-3 

batchSize = 1000 

trainRange, validateRange = 1:5000, 5001:10000 


function minibatch(x, y, indexRange) 
xBatch = Array{Float32} (undef, size(x[1])..., 1, length (indexRange) ) 
for i in 1:length(indexRange) 
xBatch[:, :, :, i] = Float32. (x[indexRange[i]]) 
end 
return (xBatch, onehotbatch(y[indexRange], 0:9)) 
end 


trainLabels = MNIST.labels() [trainRange] 

trainlImgs = MNIST.images() [trainRange] 

mbIdxs = Iterators.partition(1:length(trainImgs), batchSize) 
trainSet = [minibatch(trainImgs, trainLabels, bi) for bi in mbIdxs] 


validateLabels = MNIST.labels()[validateRange] 
validateImgs = MNIST.images()[validateRange] 
validateSet = minibatch(validateImgs, validateLabels, 1:length(validateImgs)) 


modell= Chain(flatten, Dense(784, 200,relu),Dense(200, 100,tanh), 
Dense(100, 10,sigmoid), softmax) 


model2= Chain(Conv((5, 5), 1-58, relu), MaxPool((2,2)), 
Conv((3, 3), 8=>16, relu), MaxPool((2,2)), 
flatten, Dense(400, 10), softmax) 


optl = ADAM (eta); opt2 = ADAM (eta) 

accuracy Bats aa 

accuracy (x, y, model) = mean(onecold (model (x)) .== onecold(y) ) 
loss(x, y, model) = crossentropy (model (x), y) 

cbF1() = push! (accuracyPaths [1], accuracy (validateSet..., modell) ) 
ebF2 () push! (accuracyPaths[2],accuracy (validateSet..., model2)) 





itexole. 1l. (jevaslinSerc [11 LI): model? (eese [E151 [Et 1])) 
for _ in l:epochs 


Flux.train! ((x,y)->loss(x,y,modell), params (modell), trainSet, optl, cb=cbF1) 
Flux.train! ((x,y)->loss(x,y,model2), params(model2), trainSet, opt2, cb=cbF2) 


¡aia (V c Wy 
end 


println("\nModell (Dense) accuracy = ", accuracy(validateSet..., modell)) 
println("Model2 (Convolutional) accuracy = ", accuracy(validateSet..., model2)) 
eel DIR — Y 
BSON.@save "../data/mnistConv.bson" modelParams-cpu. (params (model2)) 
plot(accuracyPaths,label = ["Dense" "Convolutional"], 

ylim-(0.7,1.0), xlabel="Batch number", ylabel = "Validation Accuracy") 
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9.3 Bias, Variance and Regularization 


WITH THE EXCEPTION OF CODE 
THE CONTENT OF THIS SECTION IS OMITTED IN THE DRAFT 


Listing 9.10: |Ridge regression with k-fold cross validation 


using RDatasets, DataFrames, Random, Statistics, LinearAlgebra 
using MultivariateStats, LaTeXStrings, Plots; pyplot () 
Random. seed! (0) 


df = dataset ("MASS", "cpus") 
= size(df) [1] 
ct ishurileyisn) e] 


10 
Int (floor (n/K)) 
= KxnG 
println("Loosing $(size(df) [1] - n) observations.") 


lamGrid = 0:100:30000 


devSet(k) = collect (1+nG* (k-1) :nG*k) 
trainSet (k) = setdiff(1:n,devSet (k) ) 





xTrain(k) = convert (Array{Float64,2},df[trainSet (k),[:Cach, :ChMin]]) 
xDev (k) = convert (Array{Float64,2},df[devSet (k),[ :Cach, :ChMin]]) 





yTrain(k) = convert (Array{Float64,1},df[trainSet (k),:Perf]) 
yDev(k) = convert (Array{Float64,1},df[devSet (k) , : Perf]) 


errVals = zeros(length(lamGrid)) 
for (i,lam) in enumerate (lamGrid) 





errSamples - zeros(K) 
for k in 1:K 
beta = ridge (xTrain(k),yTrain(k),1lam) 
errSamples[k] = norm ([ones (nG) xDev(k)]*beta - yDev (k) )^2 
end 
errVals[i] = sqrt (mean (errSamples)) 
end 


= argmin(errVals) 
bestLambda = lamGrid[i] 


betaFinal = ridge(convert (Array{Float64,2},df[:,[:Cach, :ChMin]]), 
convert (Array{Float64,1},df[:,:Perf]),bestLambda) 


macro RR(x) return: (round. ($x,digits = 3)) end 
println("Found best lambda for regularization: ", bestLambda) 
println("Beta estimate: ", (RR betaFinal) 


plot(lamGrid, errVals,legend = false, 
xlabel = L"\lambda", ylabel = "Loss") 
plot!([bestLambda,bestLambda],[0,10^3], c = :black, ylim = (750, 1250) ) 
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Listing 9.11: Tuning the dropout probability 


using Flux, Flux.Data.MNIST, Statistics, BSON, Random, StatsPlots; pyplot () 
using Flux: onehotbatch, onecold, crossentropy, @epochs 


epochs = 30 

eta = le-3 

batchSize = 200 

trainRange, validateRange = 1:1000, 1001:5000 


füncteron titi ES) 
xBatch = Array{Float32} (undef, size(x[1])..., 1, length(idxs)) 
for i in 1:length(idxs) 
¿Barcals, S5 8, sz] = Plostes2. (xilickes [2] 1) 
end 
return (xBatch, onehotbatch(y[idxs], 0:9)) 
end 


trainLabels = MNIST.labels() [trainRange] 

trainlImgs = MNIST.images () [trainRange] 

mbIdxs = Iterators.partition(l:length(trainImgs), batchSize) 
trainSet = [minibatch(trainImgs, trainLabels, i) for i in mbldxs] 


validateLabels = MNIST.labels() [validateRange] 
validateImgs = MNIST.images() [validateRange] 
validateSet = minibatch(validateImgs, validateLabels, 1:length(validateImgs) ) 





accuracy (x, y, model) = mean(onecold (model (x)) .== onecold(y) ) 
loss(x, y, model) = crossentropy (model (x), y) 


function evalAccuracy (dropP) 
model= Chain(Conv((5, 5), 1=>8, relu), MaxPool((2,2)), 
Comy((S, 3), 6=>16, relu Medal ((2,2)) y 
flatten, 
Dense (400, 200,relu), Dropout (dropP), 
Dense (200, 200,relu), Dropout (dropP), 
Dense (200, 200,relu), Dropout (dropP), 
Dense(200, 10,relu), Dropout (dropP), 
softmax) 
opt = ADAM(eta); 
@epochs epochs Flux.train!((x,y)->loss (x, y, model) , params (model), trainSet,opt) 
accuracy (validateSet..., model) 
end 


pToTest 10.0, 0.23, 045, 0.75] 

= 10 
results [[evalAccuracy (p) for _ in 1:n] for p in pToTest] 
bestAcc, bestI = findmax (median. (results) ) 
println("The best dropout probability is $(pToTest[bestI]).") 
println("It achieves $(bestAcc) accuracy on average.") 


boxplot (results, label="", 
c Uu Cpe les), 
xlabel="Dropout Probability", ylabel = "Accuracy", legend = false) 
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9.4 Unsupervised Learning Methods 


WITH THE EXCEPTION OF CODE 
THE CONTENT OF THIS SECTION IS OMITTED IN THE DRAFT 
Listing 9.12: |Carrying out k-means via the Clustering package 


using Clustering, RDatasets, Random, Measures, Plots; pyplot() 
Random. seed! (0) 


= dataset("cluster", "xclara") 
data = copy(convert(Array(Float64], df)’) 


0 JOANA 


seeds = initseeds(:rand, data, K) 
xclaraKmeans = kmeans (data, K, init = seeds) 


println("Number of clusters: ", nclusters (xclaraKmeans)) 
println("Counts of clusters: ", counts (xclaraKmeans)) 


df.Group = assignments (xclaraKmeans) 


pl SCA (A A ches, 8Weil, @=slollnea, msi 0) 
scatter! (df[seeds, :V1], df[seeds, :V2], markersize=12, c=:red, msw=0) 





p2 searcer (id & [ele Groupe —— ly EL], ele cite OUT :V2], c=:blue, msw=0) 
seainicec! (Cel el Grota == 2, EW], check Eo a V2], c=:red, msw=0) 
scarcer! (el [el Groto == 3, RW], chelche Grecia s V2], c=:green, msw=0) 





plot (p1,p2, legend=:none, ratio=:equal, 
size=(800,400), xlabel="V1", ylabel="V2", margin 
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Listing 9.13: |Manual implementation of k-means 


using RDatasets, Distributions, Random, LinearAlgebra 
Random. seed! (0) 


= dataset ("cluster", "xclara") 
n,_ = size(df) 
dataPoints = [convert (Array{Float64,1},df[i,:]) for i in 1:n] 
shuffle! (dataPoints) 


xMin, xMax = minimum(first. (dataPoints) ),maximum(first. (dataPoints) ) 
yMin, yMax minimum (last. (dataPoints) ),maximum(last. (dataPoints) ) 


means = [[rand(Uniform(xMin, xMax)),rand(Uniform(yMin, yMax))] for _ in 1:K] 
labels = rand(1:K,n) 
PESVMESan si means 


while norm(prevMeans - means) > 0.001 
prevMeans = means 
labels = [argmin([norm(means[i]-x) for i in 1:K]) for x in dataPoints] 
means = [sum(dataPoints[labels .== i])/sum(labels .==i) for i in 1:K] 
end 


cnts = [sum(labels .== i) for i in 1:K] 
println("Counts of clusters (manual implementation): ", cnts) 
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Listing 9.14: |Carrying out hierarchical clustering 


using RDatasets, Clustering, Random, LinearAlgebra, Plots; pyplot () 
Random. seed! (0) 


df = dataset ("cluster", "xclara") 

n,_ = size(df) 

dataPoints = [convert (Array{Float64,1},df[i,:]) for i in 1:n] 
shuffle! (dataPoints) 

D = [norm(ptl - pt2) for ptl in dataPoints, pt2 in dataPoints] 


result = hclust (D) 
for ing) 

clusters - cutree(result,k-K) 

println("K-$(K): ",[sum(clusters .== i) for i in 1:K]) 
end 








cluster(ell,K) = (1:n)[cutr (result,k=K) .== ell] 
cil, G2 ES c Chusiber (dbz 30) cluster (2,0), cluster (6517 30) 


= scatter( first. (dataPoints[C1]),last. (dataPoints[C1]),c=:blue, msw=0) 
scatter!( first. (dataPoints[C2]),last. (dataPoints[C2]), c=:red, msw=0) 
scatter!( first. (dataPoints[C3]),last. (dataPoints[C3]), c=:green, msw=0) 
ell in 4:30 
elat = cluster (el1,30) 
scac er (rales, (Gaited otiane s [elder 1), Lesi., (releueeue estas [els] ) z 
ms-10, c-:purple, shape-:xcross, ratio-:equal, legend-:none, 
xlabel-"V1", ylabel="V2") 





end 
plot (plt) 
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Listing 9.15: |Principal component analysis 


using Statistics, MultivariateStats, RDatasets, LinearAlgebra, Plots; pyplot() 


data = dataset("datasets", "iris") 
data = data[:,[:SepalLength,:SepalWidth,:PetalLength,:PetalWidth]] 
x = convert (Array{Float64,2},data)’ 


model = fit(PCA, x, maxoutdim=4, pratio = 0.999) 
M = projection (model) 


function manualProjection (x) 
covMat = cov(x’) 
ev = eigvals(covMat) 
igOrder = sortperm(eigvals (covMat) , rev=true) 
eigvecs (covMat) [:, eigOrder] 
end 





println("Manual vs. package: ",maximum(abs. (M-manualProjection(x)))) 


pcVar = principalvars (model) ./ tvar (model) 
cumVar = cumsum(pcVar) 
jeder e Mile, Ig 2)” wx 


pl plot (pcVar, c=:blue, label="Variance due to PC") 
plot! (l:length(cumVar), cumVar, label="Cumulative Variance", c=:red, 
xlabel="Principle component", ylabel="Variance Proportion", ylims=(0,1)) 
p2 Scateen (pebat |, pena Ile blue, x<label— "Pe i ylabel= "Pe 2", 
msw=0, legend=:none) 
plot(pl, p2, size=(800, 400) ) 








Listing 9.16: |Principal component analysis on MNIST 


1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 


using MultivariateStats, RDatasets, LinearAlgebra, Flux.Data.MNIST,Measures, Plots 
pyplot () 


imgs, labels = MNIST.images(), MNIST.labels() 

x = laca (wear (Elo. (Ga) oso) Sos im dm imasl ssa) 
pca = fit(PCA, x; maxoutdim=2) 

M = projection (pca) 


function compareDigits (dA, dB) 
imA, imB = imgs[labels .== dA], imgs[labels .== dB] 
JA hecat (cae (icicle, (Ga) 455) Eor abu in valo o.) 
a i vear (Eloet (Musso) rs lim sips) HWI s 54) 
zA, zB = M'xxA, M’ «xB 
default (ms=0.8, msw=0, xlims=(-5,12.5), ylims=(-7.5,7.5), 
legend = :topright, xlabel="PC 1", ylabel="PC 2") 
scatter (zAll,: Aza ie => ea label="Digit $(dA)") 
seater! (as, 31, zB s], esee. leales "sente Sl) Y) 
end 


Rh 
NOR Wb 





Nr 
C © oo 


plots = [] 
for k in 5 
push! (plots, compareDigits (2k-2,2k-1)) 
end 
plot (plots...,size = (800, 500), margin = 5mm) 


NNN h2 
A 0NRA 
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9.5 Reinforcement Learning and MDP 


WITH THE EXCEPTION OF CODE 
THE CONTENT OF THIS SECTION IS OMITTED IN THE DRAFT 


Listing 9.17: |Value iteration for an MDP 


using LinearAlgebra, LaTeXStrings, Plots; pyplot () 


10 
p0, pl = 
beta = 0. 
epsilon = 0.001 


function valuelteration (kappa) 
PO = diagm(1=>f111 (p0,L-1)) + diagm(-1=>fi11(1-p0,1 
EOI PO[ Jj il = I = po, po 











Pl = diagm(1=>fi11 (p1,L-1)) + diagm(-1=>fi11(1-p1,1 
Plt, Stet) = 1 = pl, pi 











RO = collect (E) 
R1 = RO .- kappa 


bellmanOperator (Vprev) = 

max. (RO + beta*PO*Vprev, R1 + betaxP1x*Vprev) 
optimalPolicy(V, state) = 

(RO+betaxP0xV) [state] >= (Rl+betaxP1xV) [state] ? 0 


Y, Vorey = fail (O, i), alll (1, 16) 
while norm(V-Vprev) > epsilon 
Vioreyv = W 
V = bellmanOperator (Vprev) 
end 


return [optimalPolicy(V,s) for s in 1:L] 
end 


kappracr eE = OgO,i1sz.0 
policyMap zeros (L, length (kappaGrid) ) 


for (i,kappa) in enumerate (kappaGrid) 
policyMap[:,i] = valuelteration (kappa) 
end 
heatmap (policyMap, fill=cgrad([:blue, :red]), 
MEeks=(Osile2i, =0. 190.132), yeiclks=((0)s ib, (gu) y 
xlabel=L"\kappa", ylabel="State", colorbar_entry=false) 
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Listing 9.18: |A Q-Learning example 


using LinearAlgebra, StatsBase, Random, LaTeXStrings, Plots; pyplot () 
Random.seed! (0) 


= 10 
pU, jl = 1/2, 3/4 
beta = 0.75 
pExplore(t) = t^-0.2 
alpina (ic) = 186—052 

= 105 





oo -q1O0o C0 R0 rb. - 


function QlearnSim(kappa) 
PO = diagm(1=>fi11(p0,L-1)) + diagm(-1=>fi11(1-p0,1 
EDIT il, POL, i T = p0, po 














P1 = diagm(1=>fill(pl,L-1)) + diagm(-1=>fi11(1-p1,1 
Ful ii, PRIMI l= ol, pt 











RO = collect (1:L) 
R1 = RO .- kappa 


xtState(s,a) = 





r 
a == ? sample(1:L,weights(PO[s,:])) : sample(1:L,weights(Pl[s,:])) 


Q = zeros(L,2) 
s=1 
optimalAction(s) = Q[s,1] >= Q[s,2] ? 0 
for t in IoT 
if rand() < pExplore(t) 
a = rand([0,1]) 
else 
a = optimalAction(s) 
end 
sNew = nextState(s,a) 
r = a == 0 ? RO[sNew] : R1[sNew] 
Q[{s,at1]=(1-alpha(t))*Q[s,a+1]+alpha (t) « (r+betaxmax (Q[sNew, 1],Q[sNew, 2]) ) 
S = sNew 
end 
[optimalAction(s) for s in 1:L] 
end 





kappar rom 000 52-0 
policyMap zeros (L, length (kappaGrid) ) 


for (i,kappa) in enumerate (kappaGrid) 
policyMap[:,i] = QlearnSim(kappa) 
end 


heatmap (policyMap, fill=cgrad([:blue, :red]), 
elas (021221, OA A wticke-(si, 1081). 
xlabel=L"\kappa", ylabel="State", colorbar_entry=false) 





9.6. GENERATIVE ADVERSARIAL NETWORKS 339 


9.6 Generative Adversarial Networks 


WITH THE EXCEPTION OF CODE 
THE CONTENT OF THIS SECTION IS OMITTED IN THE DRAFT 


Listing 9.19: |Generating images from a pre-trained generative adversarial network 


using Flux, BSON, Random, Plots; pyplot() 
Random. seed! (0) 


latentDim = 100 
CUEPUEA, outre 6 m 


gen = Chain(Dense(latentDim, 7*7*256),BatchNorm(7*7*256,relu), 
x->reshape (x,7,7,256,:) ConvTranspose((5,5),256=>128;stride=1,pad=2), 
BatchNorm(128, relu) , ConvTranspose( (4,4),128=>64;stride=2,pad=1), 
BatchNorm(64, relu) , ConvTranspose( (4,4), 64=>1, tanh; stride=2, pad=1) ) 


ce DIR 
BSON.@load "../data/mnistGAN40.bson" genParams 
Flux.loadparams! (gen, genParams) 


fixedNoise = [randn(latentDim, 1) for _ in 1:outputX*xoutputY] 
fakeImages = @. gen(fixedNoise) 
imageArray = permutedims (dropdims (reduce (vcat, 
reduce. (hcat, Iterators.partition(fakeImages, outputY))); 
dims=(3, 4)), (2, 1)) 








heatmap(imageArray, yflip = true, color = :Greys, 
size = (300,150), legend=false, ticks=false) 
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Listing 9.20: Training a generative adversarial network 


using Flux, MLDatasets, Statistics, Random, BSON 
using Flux.Optimise: update! 
using Flux: logitbinarycrossentropy 





batchSize, latentDim = 500, 100 
epochs = 40 
etaD, etaG = 0.0002, 0.0002 


images, _ = MLDatasets.MNIST.traindata (Float32) 


imageTensor = reshape(@.(2f0 * images - 1f0), 28, 28, 1, :) 
data = [imageTensor[:, :, :, r] for r in Iterators.partition(1:60000, batchSize) ] 


discs Chain (Conv ( (4,4) ,1=>64; stride=2,pad=1),x->leakyrelu. (x,0.2f0), 
Dropout (0.25) ,Conv( (4, 4) , 64=>128; stride=2,pad=1) ,x->leakyrelu. (x,0.2f0), 
Dropout (O25), <—- reshape (x, 7 * 7 * 128792), Dense (7 47% W237 1))) 
Chain (Dense (latentDim, 7*7*256) ,BatchNorm(7*7*256,relu), 
x->reshape (x,7,7,256,:),ConvTranspose ( (5,5) ,256=>128; stride=1,pad=2), 
BatchNorm(128, relu) , ConvTranspose( (4,4) ,128=>64; stride=2,pad=1), 
BatchNorm(64, relu) , ConvTranspose( (4,4), 64=>1, tanh; stride=2, pad=1) ) 








dLoss(realOut,fakeOut) = mean (logitbinarycrossentropy. (real0ut,1f0)) + 
mean (logitbinarycrossentropy. (fakeOut, 0f0) ) 
gLoss(u) = mean (logitbinarycrossentropy.(u, 1f0)) 





function updateD! (gen, dscr, x, opt_dscr) 
noise = randn! (similar(x, (latentDim, batchSize) ) ) 
fakeInput = gen(noise) 
ps = Flux.params (dscr) 
loss, back = Flux.pullback(()-»dLoss(dscr(x), dscr(fakeInput)), ps) 
grad = back (1f0) 
update! (opt_dscr, ps, grad) 
return loss 
end 





function updateG! (gen, dscr, x, optGen) 
noise = randn! (similar(x, (latentDim, batchSize) ) ) 
ps = Flux.params (gen) 
loss, back = Flux.pullback (()->gLoss (dscr (gen (noise) )),ps) 
grad = back (1f0) 
update! (optGen, ps, grad) 
return loss 
end 


optDscr, optGen = ADAM(etaD), ADAM(etaG) 
cd(@__DIR_) 
@time begin 
for ep in 1:epochs 
for (bi,x) in enumerate (data) 
lossD = updateD! (gen, dscr, x, optDscr) 
lossG updateG! (gen, dscr, x, optGen) 
Quinto “Epoch Sep, batch $ba, D loss — S(lossD), 6 loss = 5 (loss6)" 
end 
@info "Saving generator for epcoh Sep" 
BSON.@save "../data/mnistGANS (ep) .bson" genParams=cpu. (params (gen) ) 
end 








Chapter 10 


Simulation of Dynamic Models - DRAFT 


Most of the statistical methods presented in the previous chapters deal with inherently static 
data. With the exception of a few time series examples, there is rarely a time component involved, 
and typically observed random variables or vectors are assumed independent. We now move on to 
a different setting that involves a time component and/or dependent random variables. In general, 
such models are called “dynamic” as they describe change over time or space. A consequence of 
dynamic behavior is dependence between random variables at different points in time or space. 


Our focus in this chapter is not on statistical inference for such models, but rather on model 
construction, simulation, analysis, and control. Understanding the basics that we present here can 
help readers understand more complex systems and examples from applied probability, stochastic 
operations research, and methods of stochastic control such as reinforcement learning already covered 
in Section [9.5] of Chapter [9] Dynamic stochastic models are a vast and exciting area. Here we only 
touch the tip of the iceberg. 


A basic paradigm is as follows: in discrete time t = 0,1,2,..., one way to describe a random 
dynamical system is via the recursion, 
X(t+1) = HATO E05, (10.1) 


where X(t) is the state of the system at time t, £(t) is some random perturbation noise and f(-,-) 
is a function that yields the next state as a function of the current state and the noise component. 
Continuous time and other generalizations also exist. Simulation of such a dynamic model then 
refers to the act of using Monte Carlo to generate trajectories, 


X(0), X(1), X(2),..., 
for the purpose of evaluating performance and deciding on good control methods. 


Our focus is on a few elementary cases. In Section we consider deterministic dynamical 
systems. We also present the very topical SEIR epidemic model as it received much attention in 
the era of COVID-19. In Section we discuss simulation of Markov Chains both in discrete and 
continuous time. In Section [10.3] we discuss discrete event simulation, which is a general method 
for simulating processes that are subject to changes over discrete time points. In Section [10.4] 
we discuss models with additive noise and present a simple case of the Kalman filter. Then in 
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Figure 10.1: Trajectory of the predator prey model of (10.4) and (10.5) for 


X; = 0.8 (prey), X» = 0.05 (predator), and parameters a, c, d = 2,1,5. Given 
these values the system converges to the equilibrium point in red. 


purpose here is to understand it a bit better. 


Section we briefly discuss network reliability and touch on elementary examples from reliability 
theory. We close with a discussion of common random numbers in Section This Monte Carlo 
implementation strategy has been used in quite a few examples throughout our book, and our 


10.1 Deterministic Dynamical Systems 


Before we consider systems such as (10.1), we first consider systems without a noise component. 
In discrete time these can be described via the difference equation 


X(t+1) = f(X(t)), 


and in continuous time via the Ordinary Differential Equation (ODE), 


(10.2) 
d 


ga) = f(X (t)). 


(10.3) 


These are generally called dynamical systems as they describe the evolution of the “dynamic” state 
common objective is to obtain the trajectory of the system over time, given an initial state X (0). 


X(t) over time. Many physical, biological and social systems may be modeled in this way, and a 


In the case of a difference equation this is straightforward via recursion of equation (10.2) 


continuous time we use ODE solution techniques to find the solutions of (10.3). 
Discrete Time 


In 


The state X(t) can take different forms. In some cases it is a scalar, in other cases a vector, 


and yet in other cases it is an element from an arbitrary set. As a first example, assume that it 


is a two dimensional vector representing normalized quantities of animals living in a competitive 
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environment. Here X,(t) is the number of “prey” animals and X(t) is the number of “predators”. 
The species then affect each other via natural growth, natural mortality, and hunting of the prey 
by the predators. 


One very common model for such a population is the predator prey model, described by the 
Lotka- Volterra equations: 


X1(t+1) aXq(t)(1— Xi(t)) — X(t) X2(t), (10.4) 
Xa(t + 1) = —cXa(t) + dX1(t)Xa(t). (10.5) 


Here a, c and d are positive constants that parameterize the evolution of this system. For 
parameter values in a certain range, there exists an equilibrium point. For example if a = 2, c = 1 
and d — 5 an equilibrium point is obtained via, 


X*=(Xi,X3)= (=, Bu e: 2) = (0.4,0.2). (10.6) 





To see that this is an equilibrium point, observe that using X* for both X(t) and X(t + 1) in 
(10.4) and satisfies the equations. Hence, according to the model, once the predator and 
prey populations reach this point they will never move away from it. This is the definition of an 
equilibrium point. 


Listing simulates the trajectory of the predator prey model by carrying out straight forward 
iteration over (10.4) and (10.5) given an initial state, and specific values of a, c, and d. The trajectory 
can be seen in Figure {10.1} along with the equilibrium point. 


Listing 10.1: [Trajectory of a predator prey model 


using Plots, LaTeXStrings; pyplot() 


a, Er d= 2, 1, 5 
NS (xx) AA A A] 
equibPoint = [(1+c)/d , (d* (a-1)-ax (1+c)) /d] 


imac = [0,8008] 
Tamel = 100 








tral = Il fof ain It 
traj[1] = initX 





forc Tn» dil 
texas el] = mes (ea EE I] o6.) 
end 


seater [m 1111 (111, (ezas [3] 11211, 

c=:black, ms=10, 

label="Initial state") 

plo! (first. (tra) , least. (zas), 

c=:blue, ls-:dash, m=(:dot, 5, Plots.stroke(0)), 

label="Model trajectory") 

scatter! ([equibPoint[1]], [equibPoint[2]], 
c=:red, shape=:cross, ms=10, label="Equlibrium point", 
xlabel=L"X_1", ylabel=L"X_2") 
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In line 4 we define the function next () that implements the recursion of and (10.5). In line 5 
the equilibrium point is calculated via the closed form formula in (10.6). The initial state of the system 
is set in line 7, and the total number of discrete time points to iterate over is set in line 8. In line 10 
we pre-allocate an array of arrays of length tEnd, where each sub-array is an array of two elements 
representing values of X, and Xə respectively. The first element of the array is then initialized in 
line 11. Lines 13-15 loop over the time horizon and the next () function is applied at each time to 





obtain the state evolution. Note the use of the splat operator ... in line 14 for transforming the two 
elements of traj[t-1] into distinct input arguments to next (). The remainder of the code plots 
Figure 





Continuous Time 


We now look at the continuous time case through a physical example. Consider a block of mass 
M which rests on a flat surface. A spring horizontally connects the block to a nearby wall. The 
block is then horizontally displaced a distance z from its equilibrium position and then released. 
Figure [10.2] illustrates this scenario. The question is then how to describe the state of this system 
over time. 


For this example we first make several assumptions. We assume that the spring operates elasti- 
cally, and therefore the force generated by the spring on the block is given by 


F; = —kz, 


where k is the spring constant of the particular spring, and z is the displacement of the spring from 
its equilibrium position. Note that the force acts in the opposite direction of the displacement. In 
addition, we assume that dry friction exists between the block and the surface it rests on, therefore 
the frictional force is given by 


F; = bV, 


where b is the coefficient of friction between the block and the surface, and V is the velocity of the 
block. Again note that the frictional force acts in the opposite direction of the force applied, as it 
resists motion. 


With these established we can now describe the system. Let X(t) denote the location of the 
mass and X»(t) the velocity of the mass. Using basic dynamics, these can then be described via, 


X(t) | _, | Ki) 
X(t) X»(t) 


The first equation of (10.7) simply indicates that X»(t) is the derivative of Xi(t) (the notation 
of a ‘dot’ over a variable denotes the derivative). The second equation can be read as, 


where A= 








0 1 
EL | (10.7) 
M M 


MXa(t) = F; + Fy. (10.8) 


Here the right hand side is the sum of the forces described above and the left hand side is “mass 
multiplied by acceleration”. Equation (10.8) arises from are basic laws of Newtonian physics or 


10.1. DETERMINISTIC DYNAMICAL SYSTEMS 345 





ZZ 


Figure 10.2: Spring and mass system, with spring force Fs, friction force Fẹ 
and applied displacement z. 


classical mechanics. With such an ODE (sometimes called a linear system of ODEs), it turns out 
that given initial conditions X(0), a solution to this ODE is, 


X(t) = e4*X(0), (10.9) 


where e^t 


solutions to the trajectory of X(t). 


is a matrix exponential Hence using the matrix exponential is one way of obtaining 





Many other alternative methods are implemented in Julia’s DifferentialEquations pack- 
age. We use both approaches in Listing [10.2] where we compute the evolution of this system given 
a starting velocity of zero, and a displacement of 8 units to the right of the equilibrium point. The 
changing state of the system is shown in the resulting Figure [10.3] 





Listing 10.2: [Trajectory of a spring and mass system 





using DifferentialEquations, LinearAlgebra, Plots; pyplot() 


-k/M -b/M] 


Imex = [op OO] 
tEnd = 50.0 
tRange = 0:0.1:tEnd 








manualSol = [exp(A*t)*initX for t in tRange] 


linearRHS (x,Amat,t) = Amatx*x 
prob = ODEProblem(linearRHS, initX, (0,tEnd), A) 
sol = solve (prob) 








pl plo GiLeSe . (mantas ol Aa Si (memuall Sel) , 

c=:blue, label="Manual trajectory") 

pl SCS Dl (ribs. (SOL ui), lasts SO OA 

c=:red, ms = 5, msw=0, label="DiffEq package") 

pl score (aia, Tinea 

c=:black, ms=10, label="Initial state", xlims=(-7,9), ylims=(-9,7), 

ratio=:equal, xlabel="Displacement", ylabel="Velocity") 

p2 plot (tRange, first. (manualSol), 

c=:blue, label="Manual trajectory") 

p2 Seance! (Sol it, desi. (SOL) 

c=:red, ms = 5, msw=0, label="DiffEq package") 

p2 seoawrce (Oil, [baste al, 
c=:black, ms=10, label="Initial state", xlabel="Time", 
ylabel="Displacement") 

plot(pl, p2, size=(800,400), legend=:topright) 
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— Manual trajectory 


— Manual trajectory 75 DE : 
Q DiffEq package e iff€q package 
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Figure 10.3: Trajectory of a spring and mass system. 





In line 3 we set the values for the spring constant k, the friction constant b and the mass M. In 
lines 4-5 the matrix A is defined as in (10.7). In line 7 the initial conditions of the system are set, 
with the mass displaced 8 units to the right of the equilibrium point and the velocity set to zero. 
In line 11 we compute the trajectory of the system via the brute-force approach of (10.9). Here we 
use exp() from the LinearAlgebra package to evaluate the matrix exponential in (10.9). The 
resulting array manualSol is an array of two dimensional arrays (state vectors), one for each point 
in time in tRange. In lines 13-15 the DifferentialEquations package is used to solve the ODE. 
In line 13 a function which is the right hand side of the ODE of is defined. Line 14 defines an 
ODEProblem object as prob. This object is defined by the right hand side function 1inearRHS, the 
initial condition initX, a tuple of a time horizon (0,tEnd), and a parameter to pass to the right 
hand side function, A. Finally line 15 uses solve() from the DifferentialEquations package 
to obtain a numerical solution of the ODE. The remaining code generates Figure which shows 
the manual solution of the trajectory in blue, and discrete points along the trajectory obtained by 
solve() of DifferentialEquations in red. Observe that in line 19, sol.u is used to get an 
array of the trajectory of state from the ODE solution. Similarly, in line 26 sol.t is used to get the 
time points matching sol.u. 




















The SEIR Epidemic Model 


As far as things appear in mid 2020, the COVID-19 pandemic is a major historical event affecting 
human life, societies, and economies. With such an event, dynamic mathematical models are playing 
a major role in aiding policy makers for prediction and analysis. Often the models employed are 
quite complex, yet a basic deterministic dynamical system that is often used as a first step is the 
SIR (Susceptible-Infected-Removed) model as well as the slightly more detailed SEIR (Susceptible- 
Exposed-Infected-Removed) model. These types of models have existed since the 1920’s, [KM1927 
when they were developed after the major Spanish Influenza epidemic of 1918-1920. While there 
are more advanced epidemiological models in mathematical biology, understanding SIR and SEIR 
is often a first step for quantification of epidemics as well as understanding phenomena such as 
flattening the curve and heard immunity. 
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The deterministic dynamical system versions of SIR and SEIR involve a state X(t) that is 
composed of three elements in the case of SIR and four elements in the case of SEIR. Each element 
is sometimes called a compartment and hence these models are called compartmental models. A 
large finite population is assumed to be distributed among the compartments susceptible, exposed, 
infected, and removed. That is each individual is assumed to be in one of these compartments. In 
SIR the exposed compartment is not present. 


Susceptible individuals are those that are not yet ill and can become potentially ill if in contact 
with infected individuals. Exposed individuals are those that have already been in contact with 
infected individuals however their infection is currently incubated and they cannot still infect others. 
Infected individuals are those that are ill and can also infect others. The removed compartment 
is sometimes called recovered (although unfortunately in the case of COVID-19, the former term 
is more suitable because some removed individuals die). In any case, these models assume that 
recovered individuals have full immunity, as those that are removed/recovered do not affect the 
epidemic further. 


At any time, the counts of individuals in each of the four compartments (three in the case of 
SIR) is given via S(t), E(t), I(t), and R(t). However as this is a differential equation model, these 
counts are generally not integer. A population of size M is assumed and hence 





Hence practically, it is sufficient to describe the system state via only three coordinates (two in the 
case of SIR because E(t) = 0). 


The model can be parameterized in several ways and here we choose a representation based 
on the non-negative rates, 5, y, and ô. Practically if the time unit is taken as days, then ^! 
can be considered as the mean number of days between contacts of individuals and hence 6 is the 
contact rate. The value of 7! 
recovery rate. The value of 97! can be considered as the mean incubation period of the disease 
during which an individual is exposed to the virus but is still not infecting others. Hence we call 6 
the de-incubation rate. 


can be considered as the mean disease duration and hence y is the 


As is apparent at the time of writing this book, a disease such as COVID-19 incubates for about 
5 days and hence 6 = 1/5. The mean disease duration is about 10 days and hence y = 0.1. Finally, 
in a society without special social distancing, we take 6 = 0.25. There is not strong justification 
for the magnitude of this value, however one ways is to consider the basic reproduction number, Ro. 
'This elusive quantity is central to the study and discussion of epidemics and constitutes the mean 
number of individuals that an infected individual infects at onset of the epidemic. For COVID-19 
without special social distancing measures, a commonly assumed value for Ro is 2.5. Now for SEIR, 
(and SIR) models it can be shown that, 


Bo. Mean disease duration 


Ro = 





^y E 67! — Mean time between contacts’ 


Hence 8 = 0.25 yields the desired Rp = 2.5 when y = 0.1. We should mention that if one was to 
try and fit an SEIR model to real data, then tuning the 6 parameter is generally a difficult task. 
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Figure 10.4: A trajectory of a deterministic SEIR. model. 


Now with the parameters and state of the model in place we can now present the model as a 
system of differential equations: 


S(t) = -BÄ S(t) 

E(t) = BiS(tl(t)—6E(t) (10.10) 
I(t) = SE(t) - yi(t) | 
R(t) = I(t) 


As can be observed from these equations there are three types of transitions between compart- 
ments: S > E, E — I, and I > R. The latter two occur at rates proportional to the number of 
individuals in the source compartment, E(t) and yI(t). However the S — E transition is slightly 
more involved. The main driving force of infection is interaction between individuals and this is 
assumed to follow the general law of mass action. The idea is that if at time t there are S(t) suscep- 
tible individuals and I(t) infected individuals then new infections will occur at a rate proportional 
to the product S(t)I(t). This is because each individual in S(t) comes into contact with infected 
individuals at a rate proportional to I(t). 


Listing [10.3] generates a trajectory of the SEIR model with the aforementioned parameters which 
appears in Figure [10.4] In this case we assume that at onset 2.5% of the population is infected. In 
this case the final number of infected ends up with about 86%. Also of interest is the height of the 
red infection curve. Many of the social distancing measures imposed in 2020 to combat COVID-19 
were imposed with a view of reducing £ and hence reducing the height of the infection curve as well 
as the final proportion of infected individuals. You may try to modify B in the code and see the 
effect on Ro, the final proportion of infected, and the shape of the infected curve. Note that in the 
code we take M = 1 to obtain proportions. 
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Listing 10.3: [Trajectory of a deterministic SEIR epidemic 





using DifferentialEquations, Plots; pyplot () 


beta, delta, gamma = 0.25, 0.2, 0.1 
initialInfect = 0.025 
printin("RO = ", beta/gamma) 


ia = [[ioslinajedellinmeeer, 0.0, aintesallinntace, 0,0] 
tEnd = 100.0 





RHS (x,parms,t) = [ -—betaxx[1]xx[3], 
betax*x[1]*x[3] = deltaxx[2], 
delta*x[2] - gammaxx[3], 
gammaxx[3] ] 


prob = ODEProblem(RHS, initX, (0,tEnd), 0) 
Sol = solve (prob) 
println("Final infected proportion= ", sol.u[end] [4]) 








plot(sol.t,((x)-»x[1]).(sol.u),label = "Susceptible", c=:green) 
plot (Sol te, xD solu), label = “Exposed”, c=:bilue) 
plot! (sol.t, ((x)—>x([3]).(sol.u), label = "Infected", c=:xred) 
plot! (sol.t, ((x)->x[4]).(sol.u),label = "Removed", c=:yellow, 
xlabel = "Time", ylabel = "Proportion", legend = :top) 

















RO = 2.5 
Final infected proportion= 0.862203941883436 





The parameters of SEIR are set in line 3 and the initial number of infected J(0) is set in line 3. These 
parameters agree with an Ro value similar to what is believed for COVID-19. The initial state of the 
system is set in line 7 and a maximal duration in line 8. Lines 10-13 implement the right hand side 
of the SEIR system of ODEs from (10.10). The ODE is setup in line 15 and is solved in line 16. The 
final number of infected is printed line 17 and the remainder of the code creates Figure [10.4] 





10.2 Markov Chains 


In the previous section we considered systems that evolve deterministically. However sometimes 
1t is more natural and applicable to model systems and assume that they have a built-in stochastic 
component. We now introduce and explore one such broad class of models called Markov chains. 
We first consider discrete time models and then move onto continuous time. 


With a rich enough state space, many natural phenomena can be described via Markov chain 
models. Further, in certain cases such models are artificially constructed as an aid for computation. 
We saw such a use of Markov chains Monte Carlo (MCMC) in Section 5.7} and also briefly considered 
simulation of a simple discrete time Markov chain in Listing [1.8] of Section We now dive into 
further details. 


The basic model evolution introduced in the previous section followed X(t +1) = f(X(t)) 
where X(t) is the state. That is, the next state is a direct deterministic function of the current 
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state. Markov chains behave similarly, however in the case of a Markov chain X (t + 1) depends on 
X(t) probabilistically. That is, the next state X (t + 1) is drawn randomly, based on a probability 
distribution that depends on the value of X(t). For this, the model specification is typically based 
on a probability transition law, 


Pij :=P(X(t+1)=j | X(t) =i) for all states i, j. (10.11) 


Here pj; specifies the probability of transitioning from a current state ¿to a next state j. For 
every i, 


5 yu =1, 
j 


and hence the sequence (pii.pio,...) specifies a probability distribution. The actual state space 
where ¿ and 7 take values can vary depending on context. If the state space is countable, then the 
transition probabilities for all ¿ and j describe the Markov chain. Furthermore, if the state space is 
finite, then the probabilities may be organized in a transition probability matrix, P = [p;,;|, where 
each row specifies a probability distribution (or probability vector). In other cases where the state 
space is uncountable, it isn't possible to only consider events such as X(t + 1) = j and therefore 
the definition of (10.11) is varied slightly to allow X(t + 1) € A for a rich collection of sets A. We 
don't discuss such situations further here, as we assume that the state space is at most countable. 


At the onset of this chapter in (10.1), we specified the equation X(t+1) = f (X(t), £(t)), where 
£(t) is some random perturbation noise. One may ask: How does the evolution of a Markov chain 
fit this description? For this, assume that you are given the probabilities in (10.11). Now by setting 
the random perturbation noise € as a uniform [0,1] random variable, we are able to specify f(z, €) 
as a function that evaluates the inverse CDF associated with the distribution (pi,1, pi,2,...) at the 
point €. This ensures that the probabilities in are adhered to based on the inverse probability 
transform (see Section [3.4]. For illustration, we implement such a function f(-,-) in Listing [10.4] 
where we specify a transition probability matrix (see the function f1 () in the listing). 





Alternatively, in certain cases it is more natural to first consider the stochastic recursive sequence 
X(t-- 1) = f(X(t), £(t)) and to construct the associated transition probability matrix from it as 
needed. For example, assume that f(-,-) is specified as follows, 


f(z,u)=xr+u mod 5, (10.12) 


for x € {0,1,2,3,4} and u € (—1,0,+1). This describes a situation where the state is decremented, 
stays the same or incremented, all modulo 5, meaning that decrementing from 0 yields 4 and 
incrementing from 4 yields 0. By using this f(-,-) in (10.1), and assuming some probability law 
for €(t), we arrive at a stochastic model specifying random movement (with “wrap around”) on 
(0,1, 2,3, 4). It turns out that if we assume the noise component £(t) is i.i.d, then such a stochastic 
sequence may be encoded via a transition probability matrix, and that the model is a Markov chain 
even though it wasn’t initially specified via P. 


For example, say that €(t) takes values {—1,0,+1} uniformly. Then using (10.11), you may see 
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te O Transition probability matrix 
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Figure 10.5: Estimates of the distribution of the time until all states in the 
Markov chain are visited. The blue dots are generated using the transition 
probability matrix, while the red dots are generated using a stochastic 
recursive formula. 


that the corresponding transition probability matrix is 


D. 1/3 0 0 pi 

1/3 1/3 1/3 0 0 

P=|0 1/3 1/3 1/3 0 
0 0 1/3 1/3 1/3 
1/3 0 0 1/3 i 


Thus we see that the dynamics of a Markov chain can be described by either a transition probability 
matrix, or by a stochastic recursive sequence as in (10.1). In both cases, if we specify the initial 
distribution P(X(0) = i), the evolution of the sequence of random variables, X (0), X (1), X(2),... 
is well defined. 


Given the Markov chain sequence {X (t) 5, we are sometimes interested in its limiting statisti- 
cal behavior, and at other times we use this sequence to construct another random variable and are 
interested in the distribution of this variable, or just in its mean. As an example, for the Markov 
chain described above, let 7 be the minimal time such that all states have been visited: 





T — inf(t sg to, t1, ta, t3, t4 <t with X(t;) = i}. (10.13) 


It is clear that 7 is a random quantity because depending on the realization of {X(t)}?29, 7 may 
obtain different values. For example if we start with X(0) = 0 and then for the first 4 transitions 
X(t) increases, then 7 = 4. However, it may also be that 7 is a bigger number, for example if the 
sequence of states happens to be, 0,1, 2,1,2,1,0,4,0,1,2,1,0,4,3,..., then 7 = 14 because that is 
the first time where all states have been covered. 


In Listing [10.4] we illustrate both alternatives to generating a Markov chain. The function f1 () 
uses the transition probability matrix, and the function £2 () implements directly. For both 
cases we assume that P(X(0) = 0) = 1, i.e. we start in state 0 with certainty. We then estimate 
E[7] and plot estimates of the distribution of 7 in Figure It can be observed from the output 
that both methods are statistically identical. Note that it is possible to use first step analysis, a 
concept that we don’t cover further here, to analytically show that El7] = 15. 
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Listing 10.4: 'Two different ways of describing Markov chains 


using LinearAlgebra, Statistics, StatsBase, Plots; pyplot () 


m, N = 5, 10% 
PE A AO 
OE LiL (l/s, in) z 
JL em sea (L7 8 imd.) ) 
Sabado Pla, 1] = 1/3, 1/3 


A = UpperTriangular (ones (n,n)) 
C = PxA 


function f1(x,u) 
for xNew in 1:n 
if u <= C[x+1,xNew] 
return xNew-1 
end 
end 
end 


mod(x + xi , n) 


function countTau(f,rnd) 
t=0 
visits = fill (false,n) 
state = 0 
while sum(visits) < n 
state = f(state, rnd()) 
visits[state+1] |= true 
t += 1 
end 
return t-1 
end 


datal = [countTau(fl,rand) for _ in 1:N] 

Genes = [eomme nawi EA, ()—Siceiacl ([—i1,0, i) ) sere — ¿dn isin] 
estl, est2 = mean(datal), mean (data2) 

Gl, e2 counts (datal) /N, counts (data2) /N 
println("Estimated mean value of tau using fl: ",estl) 


(v 

println("Estimated mean value of tau using f2: ",est2) 
( 
( 





¡caca (Matas menciis 1934, I) 
scattertiisos elile SO] 
c=:blue, ms=5, msw=0, 





label="Transition probability matrix") 
scatter! (4:33,c2[1:30], 
c=:red, ms=5, msw=0, shape=:cross, 
label="Stochastic recursive formula", xlabel="Time", ylabel="Probability" 








Estimated mean value of tau using fl: 15.0134 
Estimated mean value of tau using f2: 15.00187 





The matrix P: 
5x5 Array{Float64,2}: 


0.333333. 0.333333 0.0 0.0 0.333333 
0,353333. 0.232333 0.323333 0:0 0.0 
0.0 0.333333 0.335333 0.333333 0.0 
0.0 0.0 0.333393. 0.933333 0.933333 
0.333333 0.0 0.0 0.333333 0.333333 
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In line 3 we set n as the number of states and N as the number of simulation runs to carry out. Lines 
4-7 construct the transition probability matrix P by using diagm() to fill the diagonals of the matrix, 
and by assigning values to the north-east and south-west entries as well. In line 9 we construct an 
upper triangular matrix, A and when it is right multiplied by P in line 10, we obtain a matrix of 
cumulative distribution vectors C. Lines 12-18 implement the function £1(). It assumes a uniform 
random variable u and returns a state using the inverse probability transform using the matrix C. 
Note that x+1 in line 14 is because we treat the states as being 0...n while the matrix indices are 
shifted by 1. For the same reason, we subtract 1 in line 15. Line 20 implements the function £2 () 
as per (10.12). 'The function countTau() in lines 22-32 operates on two input arguments f and 
rnd, each of which is assumed to be a function. It then iterates using the input arguments, 
and as it does so checks for the condition defining 7 in (10.13). Note that we can use it with both 
types of f(-) functions, each with their respective type of random variable. The actual simulation 
time step is in line 27 and then we use the ‘(self) logical or operator’, |= in line 28 to record a visit 
to the current state. Here again, state+1 is due to the discrepancy between the state space and 
array indexing. Lines 34 and 35 exhibit calls to count Tau () where in line 34, the input argument 
f1 is augmented with the systems rand function, and in line 35 we create an anonymous function, 
()->rand([-1,0,1]) asa second input argument. Lines 40-44 produce Figure [10.5] along with 
textual output showing that both methods estimate El7] similarly. 

















A few more comments about discrete time Markov chains are in order. First, note that any 
process, {X (t)) 29 that satisfies this property, 


P(X(t+1) = j | X(t) = i, X(t-1) =1_1, X(t-2) =i_g,...) = P(X(t+1) = j | X(t) = i) (10.14) 


is called a Markov chain. This Markov property indicates that given the current state (X(t) = 1), 
any previous states, i_1,i_2,... do not affect the evolution of the system. This is sometimes called 
the memoryless property or Markov property. Furthermore, all of the Markov chains that we consider 
in this chapter are time homogenous. This property states that for any times tı and ta, 


P(X(t1 +1) =j | X(t1) =1) =P(X(t2 +1) =j | X(t2) = i). 


If this were not the case, then the transition probability matrix, P would not be sufficient for 
describing the evolution of the Markov chain. Instead we would need a time-dependent family of 
matrices, P(t). Also note that Markov chains possess a variety of elegant mathematical properties 
that extended well beyond our examples here. See for an extensive introduction. 


Further Discrete Time Modeling, Analysis and Simulation 


Modeling using Markov chains sometimes involves constructing the state space and the associ- 
ated transition probability matrix for a given scenario. In some cases this is straightforward, while 
in others some modeling insight is required. We now explore another example to illustrate this. 


Consider the following fictional scenario. A series of boxes are connected in a row, with each 
adjacent box accessed via a sliding door, as in Figure In the left most box there is a cat, 
and in the right most box a mouse. Then, at discrete points in time, t = 0,1,2,..., the doors 
connecting the boxes open, and both the cat and mouse migrate from their current positions, to 
directly adjacent boxes. They always move from their current box, randomly, with equal probability 
of going either left or right one box at a time. 
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Box 1 Box 2 Box 3 Box 4 Box 5 





Cat Mouse 





Figure 10.6: Illustration of the setup, consisting of adjacent boxes, and the 
starting positions of the cat and mouse, for n — 5 boxes. 


At t — 1, both the cat and mouse must move to box 2 and 4 respectively. However at t = 2, the 
cat may move to either 1 or 3, and the mouse to either 3 or 5. This process of opening and closing 
the sliding doors repeats, until eventually the cat and mouse are in the same box, at which point 
the mouse is eaten by the cat and the game ends. 


'This situation is different from the type of Markov chain described in the previous section and 
from the weather chain described in Listing [1.8] of Section In these earlier cases, the processes 
are recurrent and go on forever. In the current case the process appears to be transient since at a 
given (random) point of time, the mouse is eaten. For recurrent Markov chains typical questions 
often deal with the steady state stationary distribution. However in a situation such as the one we 
describe here, a typical question may be: how long until the mouse is eaten? As this is a random 
variable, we may be interested in its distribution, or at least its expected value. 


When modeling such a scenario using a Markov chain there are many options because we have 
freedom as to how to describe the states. For example, one way is to describe the states as tuples 
(x,y) where z is the location of the cat and y is the location of the mouse. However, we don't have 
to consider all possible combinations of x and y because it always holds that x < y. We may also 
observe that at any given time, both the mouse and the cat are either both in odd locations or both 
in even locations. This is because they are forced to move at each step, and the process alternates 
between odd and even. Such periodic phenomena can be studied further in Markov chains, however 
for our purposes we use this knowledge to set a small state space as follows: 


State 1: (1,5). The game starts in this state. The game continues. 
State 2: (2,4). The game continues. 
State 3: (1,3). The game continues. 
State 4: (3,5). The game continues. 


State 5: (2,2), (3,3), and (4,4). The game ends. 


With the states defined, we set the state space to consist of states (1, 2,3, 4, 5) where each state 
describes a situation as depicted above. From this, the stochastic matrix P is then constructed as 


10.2. MARKOV CHAINS 399 


follows: 
0 1 0 0 0 


1/4 0 1/4 1/4 1/4 

Polo 12 0 up. 1]. (10.15) 
0 1/2 0 0 1/2 
0 0 0 0 1 


With such a representation of this Markov chain, we are now interested in the hitting time of 
state 5. That is, the time until state 5 is reached, denoted via 7 = inf{t : X(t) = 5}. It turns out 
that the theory of Markov chains goes a long way in computing expressions such as E[r]. One way 
this can be done is by considering 














0 1 0 0 
1/4 0 1/4 1/4 
0 1/2 0 0 
0 1/2 0 0 


po=[1 0 0 0J, and T= 


Here po is an initial distribution vector over the states {1,2,3,4} and T is part of the transition 
probability matrix P that relates to states {1, 2,3,4}. It can be shown using probabilistic arguments 
that, 











E] 2 po (T+T - T? +...)1, 


where 1 is a vector of 1’s. This is done by considering all possible paths that can lead to the 
absorbing state 5. Here, for each k = 0,1,2,..., each term po T^1 describes the probability of 
reaching state 5 for the first time in k steps. Now by the theory of non-negative matrices it holds 
that 





I+T+T%+...=(1-T)”", (10.16) 


and the inverse exists (T' is a sub-stochastic matrix with maximal eigenvalue strictly inside the unit 
circle). This can now be computed to find the analytic solution, 











Elr] = po (I+ T +T? +---)1= po (12 T) = 455. (10.17) 





Hence the mean time until the cat catches the mouse is 4.5. Listing illustrates this computation, 
as well as the validity of the infinite matriz geometric series, (10.16), sometimes called a Leontief 
series. It also shows that the maximal eigenvalue of T is in the unit circle. 


Listing 10.5: |Calculation of a matrix infinite geometric series 


using LinearAlgebra 
0 1 0 0 
1/4 0 1/4 1/4 
0 


for n in 1:10 

println(first(pO0*sum([T^k for k in 0:n])*ones(4))) 
end 
println("Using inverse: ", first (pOxinv(I-T)xones(4))) 
println("Eigenvalues of T: ", sort (eigvals(T))) 








306 CHAPTER 10. SIMULATION OF DYNAMIC MODELS - DRAFT 


.0 

.75 

«25 

629 

.875 

.0625 

.1875 

.28125 

.34375 

.390625 

Using inverse: 4.5 
Eigenvalues of T: [-0.7071067811865, 0.0, 2.862293735361e-17, 0.7071067811865] 


BP PP PW WWD PD 








In line 7 we construct the matrix T as the sub-matrix of the matrix P. In lines 9-11 we consider the 
LHS series in for increasing values of n. In line 12 the RHS of is calculated. Note the 
use of the inv () function to calculate the inverse of I-T. Line 13 prints the sorted eigenvalues, and 
shows that the largest eigenvalue has magnitude is less than 1 and hence all eigenvalues lie in the unit 
circle. 





Continuing with this cat and mouse example, in Listing we arrive at the same result via 
alternative methods. One method is via a first principles implementation of the scenario, which is 
done in function cmHitTime(). The two other alternative methods make use of the mcTraj () 
function which we implement. This is a much more generic function, which creates a trajectory of a 
Markov chain with an arbitrary transition probability matrix P, given a starting state initState. 
It runs either for a duration of T, or stops when hitting state stopState. Note that by default 
stopState = 0, indicating the simulation only stops after T steps. 


For illustration we use mcTraj() in two alternative ways. One way is by invoking it many 
times over (N) as follows: mcTra;j(P,1,10^6,5), where P is the transition probability matrix in 
(10.15), the second and fourth arguments are the initial and stopping states respectively, and the 
third argument, 10^6, is intended to be a high enough T such that the simulation only stops due 
to hitting state 5. Then averaging the lengths of all N trajectories yields an estimate of E[r]. 














The second way in which we use mcTraj () is related to the concept of regenerative simulation. 
We modify the final row of the transition probability matrix by setting Ps 1 = 1 and P55 = 0. 
This implies that once state 5 is reached, instead of the processes being absorbed in that state, it 
regenerates and starts afresh in state 1. In the language of Markov chains, this makes the transition 
probability matrix irreducible and hence (as it is a finite state space) positive recurrent. This then 
means that it posses a stationary distribution (or limiting distribution). It then holds that the 
inverse of the limiting probability of state 5 is the number of steps that are required to revisit the 
state. This allows us to generate one long trajectory of this Markov chain, estimate the limiting 
probability in state 5, and then obtain an estimate for El7]. 
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Listing 10.6: Markovian cat and mouse survival 


using Statistics, StatsBase, Random, LinearAlgebra 
Random. seed! (1) 


function cmHitTime () 
catindex, mouselndex, t = 1, 5, 0 
while catIndex !- mouseIndex 
catIndex += catIndex == 1 ? 1 : rand([-1,1]) 
mouselndex += mouseIndex == 5 ? -1 : rand([-1,1]) 
t += 1 
end 
return t 
end 
function mcTraj(P,initState,T,stopState-0) 
n = size(P) [1] 
state = initState 
traj = [state] 
ECM 
state = sample(1:n,weights (P[state,:])) 
push! (traj, state) 
if state == stopState 
break 
end 
end 
return traj 





= 10% 
[ 1 0 OF 
1/4 0 1/4 1/4 1/4; 
0 1/2 © 1/29 
1/2 © 1/27 
0 0 1] 


0 
0 


teo = [1 0 0 01 + (Giy = Pied, 1e4]) somes (4) ) 
estl = mean([cmHitTime() for _ in 1:N]) 
st2 = mean([length(mcTraj(P,1,10^6,5))-1 for _ in 1:N]) 





Pr5.s] = 1000 01 
pi5 = sum(mcTraj(P,1,N) .== 5)/N 
est3 = 1/pi5b - 1 


println("Theoretical: ", theor) 
panela eES timate iy "Une 
print (Mesta tea U ESEZ) 
prin Alas tantes Wp ese S) 








Theoretical: 4.5 

Estimate 1: 4.497357 

Estimate 2: 4.501016 

Estimate 3: 4.507305440667045 
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In lines 4-12 we define the function cmHit Time () which returns a random time until the cat catches 
the mouse. The initial positions of the cat and mouse (cat Index and mouseIndex respectively) 
are set in line 5. The while loop in lines 6-10 then updates these position indexes until the cat Index 
and mouseIndex are the same. Note that in line 7, if the cat is in position/box 1, then it moves to 
box 2 with certainty (+1), else its position index is uniformly and randomly incremented either up or 
down by 1. A similar approach is used for the index/position of the mouse in line 8. In lines 13-25 we 
define the function mcTraj(). As opposed to cmHit Time (), this function generates a trajectory of 
a general finite state discrete time Markov chain. The argument matrix P is the transition probability 
matrix; the argument initState is an initial starting state; the argument T is a maximal duration 
of a simulation; and the argument stopState is an index of a state to stop on if reached before T. 
The default value of 0 specified indicates that there is no stop state because the state space is taken 
to be 1,...n (the dimension of P). The logic of the simulation is similar to the simulation in Listing|1.8 
The key is line 18 where the sample function samples the next state from 1 :n based on probabilities 
determined by the respective row of the matrix P. Note that the iteration over the time horizon 1:T 
can stop if the stopState is reached and the break statement of line 21 is executed. In lines 27-31 
we define the transition probability matrix P as in (10.15). In line 33 we calculate the analytic solution 
to the average life expectancy of the mouse according to (10.17). In line 34 we use the cmHit Time () 
function to generate N i.i.d. random variables and compute their mean as est1. In line 35 we use 
the mcTraj () function setting a time horizon of 10° (effectively unbounded for this example) and 
a stopState of 5. We then generate trajectories and subtract 1 from their length to get a hitting 
time. Averaging this over N trajectories creates est2. Lines 37-39 create the third estimate, est 3 
via regenerative simulation as described above. Here we estimate the long term proportion of being 
in state 5 in line 38. 





Continuous Time Markov Chains 


A continuous time Markov chain also known as a Markov jump process is a stochastic process, 
X(t) with a discrete state space operating in continuous time t, satisfying the property, 


P(X(t--s) 2 j| X(t) =i and information about X(u) for u < t) =P(X(t+s) = j | X(t) =i). 

(10.18) 
That is, only the most recent information (at time t) affects the distribution of the process at a 
future time (t + s). Other definitions can also be stated, however captures the essence of 
the Markov property, similar to (10.14) for discrete time Markov chains. An extensive account of 
continuous time Markov chains can be found in [N97]. 


While there are different ways to parameterize continuous time Markov chain models, a very 
common way is by using a so-called generator matrix. Such a square matrix, with dimension 
matching the number of states, has non-negative elements on the off-diagonal and non-positive 
diagonal values where each entry in the diagonal is the negative of the sum of the other entries on 
the same row. This ensures that the sum of each row is 0. For example, for a chain with three 
states, a generator matrix may be: 


- 1 2 
Ds li 9 1l. (10.19) 
0 L5 -15 


The values Qj; for i Æ j indicate the intensity of transitioning from state i to state j. In this 
example, since Q1» = 1 and Q13 = 2, there is an intensity of 1 for transitions from state 1 to state 2, 
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and an intensity of 2 for transitions from state 1 to state 3. This implies that when X(t) = 1, during 
the time interval t + A, for small A, there is a chance of approximately 1 x A for transitioning to 
state 2 and a chance of approximately 2 x A for transitioning to state 3. Furthermore there is a 
(significant) chance of approximately 1 — 3 x A for not making a transition at all. 


An attribute of continuous time Markov chains is that when X(t) = i, the distribution of time 
until a state transition occurs is exponentially distributed with parameter —Qj;;. In the case of the 
example above, when X(t) = 1 the mean duration until a state change is 1/3. Furthermore, upon a 
state transition, the transition is to state j with probability —Q;;/Qij;. In addition, the target state 
j is independent of the duration spent in state i. These properties are central to continuous time 


Markov chains. See for more details. 


We can also associate some discrete time Markov chains with the continuous time models. One 
way to do this is to fix some time step A (not necessarily small), and define for t = 0,1,2,3,..., 


X(t) = X(tA). 


The discrete time process, x (-) is sometimes called the skeleton at time steps of A of the continuous 
time process X(-). It turns out that for continuous time Markov chains, 


P(X(t) = j| X(0) =4) = [e], 


i.e. is given by the i, j'th entry of the matrix exponential. Hence the transition probability matrix 
of the discrete time Markov chain X (t) is the matrix exponential e@4. This hints at one way of 
approximately simulating a continuous time Markov chain: set A small and simulate a discrete time 
Markov chain with transition probability matrix e9^. Note also that if A is small then, 


£u rq. (10.20) 


However, a much better algorithm exists. For this, consider another discrete time Markov chain 
associated with a continuous time Markov chain: the embedded Markov chain or jump chain. This 
is a process that samples the continuous time Markov chain only at jump times. It has a transition 
probability matrix P, with P;; = 0 (as there isn't a transition from a state to itself), and for 
i Æj, Pij = -Qi/Qi. The well known Gillespie algorithm, which we call here the Doob-Gillespie 
algorithm, simulates a discrete time jump chain and stretches the intervals between the jumps by 
exponential random variables to yield a trajectory of the continuous time Markov chain. At each 
iteration of the algorithm, if we are in state i, we increment time by an exponential random variable 
with rate —Q;; and choose the next state based on P;;. 


In Listing we consider a continuous time Markov chain with three states, starting with 
initial probability distribution [0.4 0.5 0.1] and with generator matrix (10.19). The code de- 
termines the probability distribution of the state at time T' = 0.25 showing that it is approxi- 
mately [0.27 0.43 0.3]. This is achieved in three different ways. The first method is via the 
crudeSimulation() function, which is an inefficient simulation of a discrete time Markov chain 
skeleton with transition probability matrix P = I + AQ, where A is taken as a small scalar value. 
The second method is via the doobGillespie() function, which is an implementation of the 
Doob-Gillespie algorithm presented above. Finally, the matrix exponential exp () is used as a 
non-Monte Carlo evaluation. 
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Listing 10.7: ¡Simulation and analysis using a generator matrix 


using StatsBase, Distributions, Random, LinearAlgebra 
Random. seed! (1) 


function crudeSimulation (deltaT,T,Q,initProb) 
n = size(Q) [1] 
Pdelta = I + QxdeltaT 
stat = sample(1:n,weights(initProb)) 
t = 0.0 
while t < T 
t += deltaT 
state = sample(1:n,weights(Pdelta[state, :])) 
end 
return state 
end 








16 function doobGillespie(T,Q,initProb) 

Iy n = size(Q) [1] 

18 Pjump = (Q-diagm(0 => diag(Q)))./-diag(Q) 

19 lamVec = -diag (Q) 

20 stat = sample(1:n,weights(initProb)) 

21 sojournTime = rand (Exponential (1/lamVec[state])) 

22 t = 0.0 

23 while t + sojournTime < T 

24 t += sojournTime 

25 state = sample(1:n,weights (Pjump[state,:])) 

26 sojournTime = rand(Exponential (1/lamVec[state] ) ) 

2m end 

28 return state 

29 

30 

31 SO) OS 

32 

33 [29 i 2 

34 L =2 1 

35 0. 1.59 =1,5]] 

36 

37 jg) = 10. 0.5 051] 

38 

39 crudeSimEst = counts ([crudeSimulation(10%*-3., T, O, p0) for _ in 1:N])/N 
40 doobGillespieEst = counts([doobGillespie(T, O, p0) for in 1:N])/N 
41 explicitEst = p0x*exp(0x*T) 

42 

43 println("CrudeSim: \t\t", crudeSimEst) 

44  println("Doob Gillespie Sim: Nt", doobGillespie 
2l ada (olas Wee, galeras.) 
































CrudeSim: [0.26845, 0.43054, 0.30101] 
Doob Gillespie Sim: [0.26709, 0.43268, 0.30023] 
Explicit: [0.269073 0.431815 0.299112] 
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In lines 4-14 we define the crudeSimulation() function, which approximately simulates a con- 
tinuous time Markov chain through the implementation of (10.20). Observe that in line 10, time is 
increment by the discrete (small) interval deltaT. In lines 16-29 we define the doobGillespie () 
function which approximates the long term distribution of the state by simulating exponentially spaced 
discrete jumps according to the logic described above. Key here is that in every iteration there are 
two random number generations. In line 25, the next state is generated according to the embedded 
Markov chain, Pjump. In line 26 an exponential random variable is generated. In line 31 we set the 
time horizon T and the number of repetitions N. In lines 33-35 we set the generator matrix, Q. In line 37 
we set the initial probability vector, p0. In lines 39-41 we evaluate the probability distribution of the 
state at time T via three alternative ways yielding the result in crudeSimEst, doobGillespieEst 
and explicitEst. 














A Simple Markovian Queue 


We now briefly explore queueing theory, which is the mathematical study of queues and con- 
gestion. See for example for an elegant introduction to the field. This field of stochastic 
operations research and applied probability is full of mathematical models for modeling queues, wait- 
ing times and congestion. One of the most basic models in the field is called the M/M/1 queue. In 
this model a single server (this is the “1” in the model name) serves customers from a queue, where 
each customer arrives according to a Poisson process and each one has independent exponential 
service times. The “M”s in the model name indicate Poisson arrivals and exponential service times 
where ‘M’ stands for ‘Markovian’, or ‘memoryless’. 


The number of customers in the system can be represented by X(t), a continuous time Markov 
chain taking on values in the state space (0, 1,2,...). In this case the (infinite) tridiagonal generator 
matrix is given by: 


a A 
p —(A+p) A 
Q= H -(à +y) A l (10.21) 


H —(A + u) 


Here A indicates the rate of arrival, changing X(t) from state 7 to state i + 1 and y indicates the 
rate of service, changing X(t) from state i to state i — 1. A common important parameter is called 
the offered load, 
À 
p-—-. 
p 
When p < 1 the process X(t) is stochastically stable, in which case there is a stationary distribution 
for the continuous time Markov chain with, 
lim P(X(t) k) = 1- p), k=0,1,2,.... (10.22) 
t— oo 
As this is simply the geometric distribution (see Section [3.5), it isn’t hard to see that the steady 
state mean (which we denote by L) is, 


Lut = i (10.23) 
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In Listing we implement a Doob-Gillespie simulation of the M/M/1 queue. First we plot a 
trajectory of the queue length process X(t) over t € [0,200] in Figure|10.7). Then we simulate the 
queue for a long time horizon and check that the empirically observed mean queue length agrees 


with the analytic solution from (10.23). 


Listing 10.8: M/M/1 queue simulation 


using Distributions, Random, Plots; pyplot () 
Random.seed! (4) 


function simulateMM1DoobGillespie (lambda,mu,Q0,T) 
Ey Q = 9.0 , QU 
tValues, qValues = [0.0], [00] 
while t«T 
if Q == 0 
t += rand(Exponential (1/lambda) ) 
Q=1 
else 
t += rand(Exponential (1/ (lambdatmu) ) ) 
Q += 2(rand() < lambda/ (lambda+mu)) -1 
end 
push! (tValues,t) 
push! (qValues, 0) 
end 
return [tValues, qValues] 
end 





function stichSteps (epochs, a) 
n = length (epochs) 
newEpochs = [ epochs[1] ] 
ao = [| ey | 
for i in 2:n 
push! (newEpochs, epochs [i]) 
push! (newQ,q[i-1]) 
push! (newEpochs, epochs [i]) 
(newQ, q[i]) 








push! 
end 
return [newEpochs, newQ] 
end 





lambda, mu = 0.7, 1.0 
plot estima tioni ZO OO 
00 = 20 


eL, qL = simulateMM1DoobGillespie(lambda, mu ,00, Testimation) 
meanQueueLength = (eL[2:end]-eL[1:end-1])'xqL[1:end-1]/last (eL) 
rho = lambda/mu 

println("Estimated mean queue length: ", meanQueueLength ) 
println("Theoretical mean queue length: ", rho/(1-rho) ) 











epochs, qValues = simulateMM1DoobGillespie(lambda, mu, QO,Tplot) 
epochsForPlot, qForPlot = stichSteps (epochs, qValues) 
plot (epochsForPlot,qForPlot, 
c=:blue, xlims=(0,Tplot), ylims=(0,25), xlabel="Time", 
ylabel="Customers in queue", legend=:none) 








Estimated mean queue length: 2.33569071839852 
Theoretical mean queue length: 2.333333333333333 
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Figure 10.7: A queue length process of the M/M/1 queue starting with 20 
customers in the system and with p — 0.7. 





In lines 4-19 we implement the simulateMM1DoobGillespie() function. This function uses 
the Doob-Gillespie algorithm to create a trajectory of the M/M/1 queue. In contrast to the 
doobGillespie() function defined in Listing our current function records the whole trajec- 
tory of the continuous time Markov chain. That is, the return value consists of tValues indicating 
times and qValues indicating state values (the state is held constant between times). Observe that 
in line 9 of the function implementation, the state sojourn time of rate A is used at it matches state 0. 
Then in line 12, the state sojourn time has rate A + y and in line 13 there is a state transition either 
up or down, independently of the state sojourn time. In lines 21-32 we define the stichSteps () 
function, which creates a trajectory that can be plotted based on an array of time epochs epochs, 
and an array of queue lengths at each epoch q. The parameters of the queue and of the simulation 
are set in lines 34-36. Note that two separate times are set. The first, Tplot - 200, is used to 
plot a trajectory starting with Q0 = 20 customers in the system. The second much longer duration, 
Testimation, is used for a simulation run that estimates the mean queue length. In lines 38-42 we 
handle the long time horizon simulation to print the estimate of the mean queue length compared 
to the theoretical value from (10.23). Importantly, in line 39 the difference sequence of time jumps 
is calculated via eL[2:end]-eL[1:end-1]. By taking the inner product of this vector with the 
queue lengths we are able to integrate over the queue length from time 0 until the last time, eL, and 
obtain the average queue length.In lines 44-48 we run a simulation for the short time horizon, apply 
stichSteps () to it, and plot the trajectory in Figure [10.7] 








A Stochastic SEIR Model 


We now return to the epidemic scenario modeled in Listing|10.3| There is a stochastic continuous 
time Markov chain version of this epidemic model which in many ways is more natural than its 
deterministic counterpart and can provide more information than a deterministic model. The basic 
idea is to define a continuous time Markov chain where the state X(t) = (stb, E(t), I(t), R(t) takes 


values in the discrete set (0,1,2,..., MY where M is the number of individuals in the epidemic and 
the components of X(t) have the same interpretation as in the deterministic model of Listing |10.3 


Now transitions between states follow intensities parameterized by 6,7, and 6, similarly to the 
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Figure 10.8: A single trajectory of a stochastic SEIR model. The “Final 
Infected” points result from 30 other independent trajectories. 


deterministic model of Listing That is, given that that the state is (s, e, i,r) € (0,1,2,..., M} 
at a given time, the following transitions can occur: 
; s . B : 
(s,e,i,r) > (s—l,e+1,i,r) at rate MAX 
(s,e,i,r) > (s,e—1,7+1,r) at rate M xóxe, 


(s,e,i,r) > (s,e,i—l,r4 1) at rate *y xi. 


This type of model as well as similar stochastic epidemic models is analyzed in [DGO1|. See also 
the documentation for DifferentialEquations.jl where methods and specialized code for 
defining and simulating chemical reactions models using are introduced. 





In Listing [10.9]we execute a Doob-Gillespie simulation of the stochastic SEIR model. The nature 
of the simulation code is similar to the M/M/1 simulation in Listing [10.8]even though the underlying 
system and model is very different. The listing generates a single trajectory of S(t), E(t), I(t), R(t) 
plotted in Figure It also generates 30 additional trajectories for which we only plot the end 
value of the number of removed. This allows to obtain an assessment of the variability of the results 


predicted by the model. Compare the stochastic Figure with Figure 
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Listing 10.9: |Stochastic SEIR epidemic simulation 


using Distributions, Random, Plots; pyplot () 
Random. seed! (0) 


beta, delta, gamma = 0.25, 0.2, 0.1 
initialInfect, M = 0.025, 1000 
IO = Int (floor (initialInfect*M) ) 

= 30 


function simulateSIRDoobGillespie (beta, delta, gamma, 10,M,T) 
Er Sy Bp Tp R= 0.0, MELO, 0, 10, 0 
tValues, sValues, eValues, iValues, rValues = [0.0], [S], [E 
while t<T 
infectionRate = betax*1I«S 
symptomRate = deltaxE 
removalRate = gamma*I 
totalRate = infectionRate + symptomRate + removalRate 
probs = [infectionRate, symptomRate, removalRate]/totalRate 
t += rand (Exponential (1/ (totalRate) ) ) 
u = rand() 
as Wl «€ peolas (111 ] 
S -= 1; E += 1 
elseif u < probs[2] 
I+=1 




















else 
R 

end 
push! (tValues,t) 
push! (sValues,S);push! (eValues,E);push! (iValues,I);push! (rValues, R) 
I == 0 && break 

end 

return [tValues, sValues, eValues, iValues, rValues] 

end 





tV,sV,eV,iV,rV = simulateSIRDoobGillespie (beta/M,M«xdelta, gamma, 10,M, Inf) 
finals = [simulateSIRDoobGillespie (beta/M, M*delta, gamma, 10,M, Inf) [5] [end] 
EOL MAN AM 


= plot (tV,sV/M, label = "Susceptible", c=:green) 
plot! (tV,eV/M, label = "Exposed", c=:blue) 
plot! (tV, iV/M, label = "Infected",c-:red) 
plot! (tV,rV/M, label = "Removed", c=:yellow, 
xlabel = "Time", ylabel = "Proportion", 
legend = :topleft, xlim = (0,tV[end]*1.05) ) 
scatter! (tV[end]*1.025*ones(N),finals, c = :yellow,label= "Final Infected") 


























The model parameters are set in line 4 and the initial infected proportion, population size, and initial 
number of infected are set in lines 5-6. The number of replicates for observing the end behavior is set 
in line 8. The function simulateSIRDoobGillespie () in lines 9-32 simulates the epidemic where 
the three driving rates that may occur are set in lines 13-15 and then the probabilities of transition 
are probs in line 17. The code in line 29 executes the break statement only if I==0. This use of 
short circuit evaluation is a common Julia idiom. In line 34 we run a single trajectory which is later 
plotted. Then in lines 35-36 we run N trajectories only for the purpose of evaluation the end size of 
the epidemic. Both the single run and a scatter plot of the finals array are plotted. 
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10.3 Discrete Event Simulation 


We now introduce the concept of discrete event simulation. This is a way of simulating dynamic 
systems that are subject to changes occurring over discrete points of time. The basic idea is to 
consider discrete time instances, Ti < To < T3 < ..., and assume that in between T; and T;+1 the 
system state model X(t) remains unchanged, or follows a deterministic path. At each discrete time 
point 7; the system state is modified due to an event that causes such a state change. This type of 
simulation is often suitable for models occurring in logistics, social service, and communication. 


As an illustrative hypothetical example, consider a health clinic with a waiting room for patients. 
Assume that two doctors are operating in their own rooms and there is a secretary administrating 
patients. T'he state of the system can be represented by the combination of: the number of patients 
in the waiting room; the number of patients (say 0 or 1) speaking with the secretary; the number of 
patients engaged with the doctors; the activity of the doctors (say administrating aid to patients, 
on a break, or not engaged); and the activity of the secretary (say engaged with a patient, speaking 
on the phone, on a break, or not engaged). 


Some of the events that may take place in such a clinic may include: a new patient arrives to the 
clinic; a patient enters a doctors’ room; a patient leaves the doctors’ room and goes to speak with 
the secretary; the secretary answers a phone call; the secretary completes a phone call, etc. The 
occurrence of each event causes a state change, and these events appear over discrete time points 
Ti < To < T3 < .... Hence to simulate such a health clinic, we advance simulation time, t, over 
discrete time points. 


The question then is, at which time points do events occur? The answer depends on the simu- 
lation scenario since the time of future events depends on previous events that have occurred. For 
example, consider the event “a patient leaves the doctors’ room and goes to speak with the secre- 
tary”. This type of event will occur after the patient entered the doctors’ room, and is implemented 
by scheduling the event just as the patient entered the doctors’ room. That is, in a discrete event 
simulation, there is typically an event schedule that keeps track of all future events. Then, the 
simulation algorithm advances time from T; to 7;,1, where 7;,4 is the time corresponding to the 
next event in the schedule. Hence a discrete event simulation maintains some data structure for the 
event schedule that is dynamically updated as simulation time progresses. General commercial 
simulation software such as AnyLogic, Arena and GoldSim do this in a generic manner, however 
in the examples that we present below, the event schedule is implemented in a way that is suited 
for our example simulation problem. For other applications, one can also look at the SimJulia 
package, which is for process oriented simulation in Julia, and is briefly mentioned in Appendix [C] 


We now return to the single server queue, similar to the M/M/1 queue that was simulated as 
a continuous time Markov chain in Section [10.2] with the Doob-Gillespie algorithm. In cases where 
inter-arrival or processing times in the queue are no longer exponentially distributed, modeling the 
system as a continuous time Markov chain is not easily possible (it is possible by means of extension 
of the state space, however this is not always the easiest implementation). Instead, simulating the 
system using discrete event simulation is straightforward. 


In the case of a single server queue there are two types of events: (i) Customer arrives to the 
system, and (ii) Service completion of a customer. In this case, a discrete event simulation only 
needs to maintain a schedule of when each of these events is to occur in the future. We now elaborate 
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on this via two simple variants of the M/M/1 queue. 


M/M/1 vs. M/D/1 and M/M/1/K 


We now consider two variants of the M/M/1 queue model covered in [10.2] namely the M/D/1 
and M/M/1/K models. In the M/D/1 model, the ‘D’ stands for deterministic service times. This is a 
model where there is no variability of service durations, i.e. all customers require a service of duration 
exactly u71. In a sense, such a model appears simpler than M/M/1, however mathematically it is 
slightly more challenging for analysis. Nevertheless, in queueing theory it is a special case of the 
M/G/1 queue, where ‘G’ stands for a general distribution of service time. For this, the Khinchine- 
Pollatzek formula (see for example [HB13]) may be used to obtain the steady state mean number 
of customers in a system, which exists when p = A/p < 1. 


The second M/M/1 variant that we consider, M/M/1/K, is actually mathematically simpler. 
This model assumes that the system has finite capacity of size K. That is, at times when there are 
K — 1 customers in the queue and one is being served (a total of K in the system), then any arriving 
customers are lost and never return. From a mathematical perspective, this actually implies that 
M/M/1/K systems are finite state continuous time Markov chains with generator matrix, 


ES A 
B —(A+p) A 
H (Ad) à 


= LR x & | (10.24) 


| aes 


Compare (10.24) with the (infinite size) matrix (10.21) of the standard M/M/1 queue. For any 
p = A/u # 1 this generator matrix possess a truncated geometric steady state distribution (and 
for p = 1 a uniform distribution). In this case, it is easy to compute the steady state mean queue 
length. 


Based on the above and after some analytic calculations, we have that mean queue lengths for 
all three systems are as follows: 


p 





IMM/1 = 327 (10.25) 
2— 
Lupa = r) (10.26) 
1—(K+1)p* + Kp*t 
tuma AAA (10.27) 


The first is the mean queue length of an M/M/1 queue in steady state. The second refers to an 
M/D/1 queue where the service times are deterministic. The third is for an M/M/1/K queue (finite 


capacity). It may be interesting to compare (10.25) to both (10.26) and (10.27). From the formulas, 
keeping in mind that p < 1, it isn't hard to see that each of LM/D/1 and LM/M/1/K are lower 
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than LM/M/1: Interestingly when p ~ 1 (but smaller than 1), the M/D/1 case has queue lengths 
that are approximately half as long on average than M/M/1. 


We now compare these theoretical formulas and observations to averages obtained via discrete 
event simulation. In Listing [10.10] we implement a function queueDES (), which performs discrete 
event simulation for a finite or infinite capacity queue. The simulation considers these three queue 
variants with p 0.63, and the queue length estimates obtained for a long time horizon are shown 


to closely match the analytic formulas of (10.25), (10.26) and (10.27). 
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Listing 10.10: |Discrete event simulation of queues 


using Distributions, Random 
Random. seed! (1) 








function queueDES(T, arrF, serF, capacity = Inf, initQ = 0) 
t» Cp Sih = O50, aub, 0.0 


nextArr, nextSer = arrF(), q == 0 ? Inf : serF() 
while t < T 
tire, Gexew = Cp € 
if nextSer < nextArr 
i = MekeSer 
ey == il 
2 e > | 
nextSer 
else 
nextSer 
end 
else 
t = nextArr 
if q == 0 
nextSer ¡IS Si Bn) 
end 
ess e < Capac ley 
er a= i 








nextArr = t + arrF() 
end 
gu a= (ic tPrev) *qPrev 
end 
return qL/t 
end 








iam, ma, X = 0,82, d.95 Y 
rho = lam/mu 
T = 10^6 


mmlTheor = rho/ (1-rho) 
mdiTheor = rho/ (1-rho) « (2-rho) /2 
mmlkTheor = rho/ (1-rho) « (1- (K+1) *rho"K+Kx*rho"” (K+1) ) / (l-rho% (K+1) ) 








mmlEst queueDES (T, () ->rand (Exponential (1/lam)), 

() -»rand (Exponential (1/mu) ) ) 

mdlEst queueDE , ()-»rand(Exponential(1/lam)), 

() -»1/mu) 

mmlkEst = queueDES(T, ()->rand(Exponential(1/lam)), 

()->rand (Exponential (1/mu)), K) 





























println("The load on the system: ", rho) 
println("Queueing theory: ", (mmlTheor,mdlTheor,mmlkTheor) ) 
println("Via simulation: ", (mmlEst,mdlEst,mmlkEst) ) 























The load on the system: 0.6307692307692307 
Queueing theory: (1.7083333333333333, 1.169551282051282, 1.3050346932359453) 
Via simulation: (1.7134526994574817, 1.1630297930829645, 1.302018728470463) 
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In lines 4-31 we implement the function queueDES () which carries out a discrete event simulation 
queue for up to T time units. The arguments arrF and serF are functions that present queueDES () 
with the next inter-arrival time and next service time respectively. The argument capacity (with 
default value oo) sets a queue limit to the queue (as needed for the M/M/1/K model). The argument 
inito (with default value 0) is the initial queue length. In line 4 the initial time and queue length 
are set, along with the variable qL which is used later to calculate the average queue length. It is 
essentially a running Riemann sum, i.e. the sum of products of the time between each event by the 
length of the queue in between each event, as calculated in line 28. The main simulation loop is in 
lines 8-29. If the next service time occurs before the next arrival, the queue is decremented by one, 
and the service time is updated. If the next arrival occurs before the next service time, the queue 
is increased by one (as long as the queue is not at capacity) and the next arrival time is updated. 
Regardless of which occurs, qL is updated in line 28. This process continues until the time exceeds T. 
In line 30 the average queue length for the simulation is calculated and returned. In lines 33-35 the 
parameters of our three different queues are set, along with the maximum time units to be simulated 
T, and in lines 37-29 the analytic solutions of the three queues are calculated as per (10.25), (10.26), 
and (10.27). In lines 41-46 the three queues are simulated via queueDES (), and the numerically 
estimated mean queue lengths printed alongside their analytic counterparts in lines 48-50. 











Waiting Times in Queues 


The previous example of discrete event simulation maintained the state of the queueing system 
only via the number of items in the queue and the scheduled events. However, in some situations we 
need to maintain a more detailed state representation. For example, instead of just keeping track 
of ‘how many customers’ are in the system we may want to keep “individual information about each 
customer’. We now consider such a case with an example of waiting times in an M/M/1 queue 
operating under a first come first served policy. We have already touched such a case in Listing [3.6 
of Chapter |3| and in that example we implicitly used the formula, 


P(W <2) =1- pe AE, for ao 0, (10.28) 


where W is a random variable representing the waiting time of a customer arriving to a system in 
steady state. Observe that for x = 0, P(W < 0) = P(W = 0) = 1 — p. That is, the probability of 
not waiting at all is 1 — p. This is in agreement with the steady state distribution since by 
setting k = 0 in that equation we obtain we see that in steady state, the system is empty a fraction 
1 — p of the time. Observe that W is a random variable that is neither purely discrete nor purely 
continuous. There is a ‘mass’ at x = 0 and then for x > 0 it is continuous. 


To get a a feeling for the mathematical nature of queueing theory, we now present a derivation for 
(10.28). It is obtained by considering the random variable X, representing the number of customers 
in the queue in steady state. As in (10.22), X it has a geometric distribution. Now, by conditioning 
on the values of X we are able to use the law of total probability to derive the complement of 


(10.28) for strictly positive values of x. 


In the second step of the derivation we assume that for k = 1,2,... customers, the waiting time 
of the arriving customer is distributed as the sum of k independent exponential random variables, 
each with mean ju 1. This is the density f(u) which is a gamma (called Erlang) distribution. The 
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Figure 10.9: The CDF of the waiting time distribution 
in an M/M/1 queue with p = 0.8. 


remainder of the calculation is slightly detailed, but straightforward. Here are the details: 


P(W> 2) = Y P(W>a | X =K)P(X =4) 
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The M/M/1 queue is one of a few special cases as we are able to use such probabilistic analysis to 
obtain an explicit formula for the distribution of the waiting time. However in stochastic modelling, 
if we modify the system even slightly, it is often the case that such an explicit performance measure 
is hard to come by, and hence discrete event simulation is often used. Sticking with M/M/1 so we 
can compare analytic and simulated solutions, in Listing [10.11] we carry out a simulation for the 
M/M/1 queue. A comparison between the ECDF obtained from the simulation and the analytic 
CDF is shown in Figure It can be observed that there isn’t a perfect match because 
we use a short time horizon in the simulation. You may modify the code by increasing T in line 42, 
to observe a tight fit. 
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Listing 10.11: Discrete event simulation for M/M/1 waiting times 


using DataStructures, Distributions, StatsBase, Random,Plots, LaTeXStrings; pyplot () 


function simMM1Wait (lambda, mu, T) 
tNextArr = rand(Exponential (1/ (lambda) ) ) 
tNextDep = Inf 
E — ENextAre 





waitingRoom = Queue{Float64} () 
serverBusy = false 
waitTimes = Array{Float64,1} () 


while t<T 
if t == tNextArr 

if !serverBusy 
tNextDep = t + rand (Exponential (1/mu) ) 
serverBusy = true 
push! (waitTimes,0.0) 

else 
enqueue! (waitingRoom, t) 





end 
tNextArr = t + rand(Exponential (1/ (lambda) ) ) 
else 
if length(waitingRoom) == 0 
tNextDep = Inf 
serverBusy = false 
else 
tArr = dequeue! (waitingRoom) 
waitTime = t - tArr 
push! (waitTimes, waitTime) 
tNextDep = t + rand(Exponential (1/mu)) 
end 
end 
t = min(tNextArr,tNextDep) 
end 








return waitTimes 
end 


Random. seed! (1) 
lambda, mu = 0.8, 1.0 
= 10%3 


data = simMM1Wait (lambda,mu, T) 
empiricalCDF = ecdf (data) 


F(x) = 1-(lambda/mu) «MathConstants.e” (- (mu-lambda) x) 
xGrid = 0:0.1:20 


plot (xGrid, F.(xGrid), 
c=:blue, label="Analytic CDF of waiting time") 
plot! (xGrid, empiricalCDF (xGrid), 
c=:red, label="ECDF of waiting times", 
xlabel=L"x", ylabel=L"\Prob(W \leq x)", xlims=(0,20),ylims=(0,1), 
legend=:bottomright) 
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In lines 3-37 we define the main function used in this simulation, simMM1Wait (). This func- 
tion returns a sequence of waitTimes for consecutive customers departing from the queue simu- 
lated for a time horizon T. The simMMiWait() function uses a Queue data structure from the 
DataStructures package. This waitingRoom variable, defined in line 8, represents the waiting 
room of customers, and its elements represent the arrival times of customers. The main simulation loop 
is in lines 12-34. Lines 14-21 handle an ‘arrival’ event. while lines 23-31 handle a ‘departure’ event. 
In line 19, when new arrivals to the busy server occur, new elements are added to waitingRoom via 
the enqueue! () function. In line 23, length () is applied to waitingRoom to see if the queue is 
empty. If it is empty, then lines 24-25 set the state of the system as ‘idle’ by setting the next depar- 
ture time, tNextDep to Inf and serverBusy to false. On the other hand, in lines 27-30 a new 
customer is pulled from the waiting room via dequeue! () while line 28 calculates the wait Time 
that that customer has experienced. In line 29 that waiting time is pushed to waitTimes. In line 29 
the service duration of that customer is randomly generated and tNextDep is set. Lines 39-41 set 
the parameters. Lines 43-44 execute the simulation and compute the ECDF via ecdf(). Line 46 


implements (10.28) as F (). The remainder of the code generates Figure 





10.4 Models with Additive Noise 


In Section [10.1] we considered deterministic models. We then followed with inherently random 
models, including Markov chains and discrete event simulation. We now look at a third class of 
models. These are based on deterministic models that have been modified to incorporate random- 
ness. A basic mechanism for creating such models is to take a system equation such as (10.2), and 
augment it with a noise component in an additive form. Denoting the noise by £(t) we obtain, 


X(t+1) = f(X(t)) + €(t). (10.29) 


A similar type of modification can be done to continuous time systems, yielding stochastic differential 
equations. However our focus here will be on the discrete case. 


As an illustrative example, we revisit the predator-prey model explored in Listing [10.1] For this 
example we add i.i.d. random variables with zero mean and a standard deviation of 0.02 to the prey 
population. This is done in Listing[10.12] and the resulting stochastic trajectory is plotted alongside 
the previously calculated deterministic trajectory as shown in Figure Note that this listing 
is very similar to Listing |10.1| with the main difference being in line 21 where the addition of the 
noise vector rand (Normal (0, sig) ),0.0] applies normally distributed disturbances to the prey 
and no explicit disturbances to the predator population. 


By adding such a noise component one may generate multiple trajectories of X(t), each for a 
different point w in the probability sample space. Then a mean trajectory may be estimated by 
considering an ensemble average over all the generated trajectories. Similarly, variability estimates 
and confidence bands can be obtained. Note that in contrary to what some practionars wrongly 
believe, even if the noise is zero mean, it does not generally hold that the expected value of X(t) 
of (10.29) equals X(t) of (10.2). That is, the nature of the noise modifies the expected trajectory 


even though at every step, the noise is on average zero. 
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Initial state 
Deterministic trajectory 
Stochastic trajectory 
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Figure 10.10: Trajectory of a stochastic predator prey model together with a 
deterministic model. 


Listing 10.12: [Trajectory of a predator prey model with noise 


using Distributions, Random, Plots, LaTeXStrings; pyplot () 
Random.seed! (1) 


an Cy Cla dt 

sig = 0.02 

mesi (x37) A (il=x<)) SS | 
equibPoint = [(1+c)/d , (dx (a-1)-ax (1+c))/d] 


initX = [0.8,0.05] 
temo, tenostoen = 100, 10 














zas = IL => — nm iE CENA] 
erans tcoch = [11 es — sha I sitimmelSicoela]| 
Era lL, ExayStocia (1 ] = immi», dumis 











for in 2:tEnd 
sale = mee (eras lie dE TL oos) 
end 


for in 2:tEndStoch 
crasjsStoca El = mest(trajstesmit-ilsss) s lane (Normal (0; sip) ) , (0 0] 





end 





cor (ezas (14141 (lll, — pese E33 121 1, 
c=:black, ms=10, label="Initial state") 
cl (fex. (em), last. (Exa), 
c=:blue, ls-:dash, m=(:dot, 5, Plots.stroke(0)), 
label="Deterministic trajectory") 
el (Eis , (ras Sitocn) , last. (rra SEC) y 
c=:green, ls=:dash, m=(:dot, 5, Plots.stroke(0)), 
label="Stochastic trajectory") 
tter!([equibPoint[1]], [equibPoint[2]], 
c=:red, shape=:cross, ms=10, label="Equlibrium point", 
xlabel-L"X 1", ylabel-L"X 2") 
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State Tracking in Linear Systems 


Many physical systems can be modeled by the evolution, 


X(t+1) AX (t) + £(t), 
Y(t) = CX(t)+¢(t). 


(10.30) 


Here, X(t) and Y (t) are the state and observation vectors respectively, while €(t) and ¢(t) are state 
and observation disturbances respectively, and are often described by independent sequences of 
i.i.d. random variables. Such models are often used in control theory and linear system theory. See 
for an overview description of control theory and for a comprehensive introduction. 


The matrix A describes the state evolution in a similar manner to the spring-mass example in 
(10.7), while the matrix C maps the current state to the measurement vector (prior to the addition 
of noise). That is, for such a system, the sensors’ measurements are represented via Y (t). In general, 
such systems are called linear systems with additive noise. One desire in such systems is to use the 
sensor measurements, Y (0), Y (1), Y (2),..., to estimate (or track) the state as time progresses and 
the system is running. 


Even if the number of sensors (dimension of Y(t)) is much smaller than the number of state 
variables (dimension of X(t)) we can often track the state X(t) effectively. Furthermore, as we 
show below using Kalman filtering, we may even do so in the presence of the disturbance £(t) and 
measurement noise ¢(t). To this end first assume that £(t) and ¢(t) are both 0 vectors, i.e. there 
isn’t any noise. In this case, the Luenberger observer is a state estimate X(t) which is parameterized 
by the gain matriz K, and operates as follows: 


X(t-1-AX(t)-K(Y(t)-Y(t), wth  Y(t)=CX(t). (10.31) 


Here, at time t = 0, X (0) is arbitrarily initialized, and then based on the observations Y (0), Y (1), . . . 
the state estimate is iterated as follows: 


X(t-1) = AX(t) - K(CX()) - Y (t)) 
= (A- KO)X(t)+ KY (0). 


In this case if we consider the estimation error, e(t) = X(t) — X(t), then we can show that, 


e(t--1) = X(t41)- X(t4-1) 
= Ax()- (4&0 -KCO - Ye) ) 
- A(X a X(t) - K(Y Uh a (10.32) 
= A(X(t)— X(t) - K(CX(t) - CX(0) 
- Ate) KCe(t) 
= (A KO)Je(t). 


Hence if we can design (or choose) a gain a matrix K such that A — KC is a stable matriz 
(all eigenvalues are within the unit circle), then the Luenberger observer will have e(t) > 0 
as t — oo. Remarkably, it turns out that if the pair A and C satisfy a rank condition called 
observability then we can always find such a matrix K, and hence always design a Luenberger 
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observer to have asymptotically perfect tracking. If A is an n x n matrix and C is a p x n matrix, 
then the observability matriz is the np x n matrix, 


C 
CA 
O= CA? 


gas 


The system is said to be observable if the matrix is full rank, i.e. in this case it requires O to have 
linearly independent columns. See for more details. 


To appreciate the potential strength of the Luenberger observer, imagine a complex system 
where the state X(t) is high dimensional, say 100, but the observations vector is much smaller, say 
only 3 dimensional (the matrix C would be 3 x 100). In such a system, subject to the technical 
observability condition on A and C, we can design Luenberger observer with matrix K such that 
after the system runs for a while, we have a near perfect representation of X (t) via our X (t). This is 
only based on 3 dimensional measurements Y (t) at each time point! See for further details. 


Now we allow noise £(t) and C(t) and present Kalman filtering. Here we wish to find an optimal 
gain matrix K that will also take the statistical characteristics of the disturbance vectors €(t) and 
¢(t) into consideration. We do this based on the Linear Minimum Mean Square Error (LMMSE). 
Using the notation || . || for the L2 norm, we try to set X (t) to be a linear function of the observed 
values which minimizes, 














y E Ix (t) e 20117, (10.33) 


t=1 


for some time horizon T, or (often more practically), for the infinite horizon time average, 
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The latter also generally equals the steady state expected mean squared error, 
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|X (£) - ROP. (10.34) 


For these cases, the Kalman filter is an algorithm that computes the gain matrix K (or sequence 
of matrices in the case of a finite horizon) which, if used in a Luenberger observer (10.31), yields a 
LMMSE solution. That is a Kalman filter is a way to find a good gain matrix K. 


Note that if the disturbances are assumed to be Gaussian then the LMMSE solution is also a 
Minimum Mean Square Error (MMSE) solution. That is, with Gaussian noise the Kalman filter is 
optimal in the MSE sense, while with non-gaussian noise it is optimal only within the class of linear 
estimators. 


We skip the full details of implementing a Kalman filter. Nevertheless, we mention that a 
sequence of gain matrices for minimizing (10.33) can be computed recursively via the Kalman 
filtering algorithm or alternatively, the steady state Kalman gain K for minimizing (10.34), can be 
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Figure 10.11: Left: Trajectory of a linear system with noise tracked by a 
Kalman filter. At time t = 40 the system is disturbed and it takes a few time 
epochs for the Kalman filter to catch up. Right: a plot of the steady state 
MSE as a function of gain. The optimal gain is at k — 0.3. 


computed by solving a Riccati equation which considers the system matrices A and C, as well as 
the covariance matrices of £(t) and ¢(t). We skip the details. If you are interested in the full details 
of Kalman filtering and ways to compute gain matrices, refer to [AM07]. We now present a simple 
scalar example similar to example 10.26 from [LGO08]. 


A Scalar Example of Kalman Filtering 


For this example, we construct a model based on (10.30), where all the variables are scalar, 


X(t+1) = aX(t)+&t), 


(10.35) 
Y(t) = X(t) +C(t). 


We assume a € (0,1) and hence this model describes a system that tends to revert towards O by 
a factor of a at each time unit. Also assume that €(t) and C(t) are independent zero mean normal 
random variables with variances oi and o? respectively. 


The process X(t) is sometimes called an autoregressive process of order 1, denoted AR(1) and 
among other phenomena, it can be used to describe the temperature of a system where ‘0’ is 
taken as the reference point temperature. If undisturbed, the temperature X(t) quickly converges 
to 0. However, since it is subject to temperature disturbances €(t), there are fluctuations in the 
temperature. 


The measurement is imprecise as there are measurement disturbances present, ¢(t), and the 
measured temperature Y(t) deviates from the actual temperature X(t). Our goal is then estimate 
the current temperature at time t based on the measurement history, Y (0), Y (1), ..., Y (t — 1). 
Following (10.31), the state estimate evolution follows, 


X (t - 1) = aX(t) — k(X (t) - Y (£)), (10.36) 
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where X(0) is some initial value. 


Momentarily ignoring the noise components, consider the error dynamics (10.32) with A = a, 


K =k (scalar gain parameter), and C = 1. We have that the matrix A — KC in (10.32) is simply 
the scalar a — k and hence if there isn’t any noise, we expect e(t) > 0 as long as |a — k| < 1. For 
example we can set k = a and just get from X(t +1) = aY(t) as one would naively do 
without thinking about filtering measurements. However, in the presence of noise the error e(t) 
will continue to fluctuate and hence there may be better choices of k that also take the variance 
parameters oz and o? into consideration. 


Observe the left plot of Figure for a system with a = 0.8, o? = 0.36, and o? = 1.0. Here 
X(t) is plotted in blue after starting at time t = 0 at a value of 10 and suffering an exogenous 
disturbance at time t = 50 to a level of —20 (this is a disturbance that isn’t part of the model). 
The measurements Y (t) are in black, and the trajectory of X(t) using a gain k = 0.3 is in red. The 
Kalman filter specifies this of value of k, as we discuss below. You can visually observe that filtering 
generally does a better job in tracking the signal than just using Y(t) as an estimate. 


What plays a role in finding the optimal gain k? To get some sense into how this can be done, 


recall the computation in (10.32) and repeat it for (10.35) (again using the Luenberger observer 


(10.31}), 


et+1) = X(t+1)-X(t+1) 

= aX(t) + E(t) — (aX (o) — k(X(t) — X(t) — c(t) ) 

= a(X(t)— X(t) — k(X(t) — £(t)) + &(t) — kc(t) (10.37) 
ae(t) — ke(t) + E(t) — kG(t) 
= (a—k)e(t) + E(t) — kc(t). 


Taking variance of both sides of the equation we obtain, 


Var(e(t + 1)) = (a — k)?Var(e(t)) + oz + k?o?. (10.38) 


Now observe that the steady state MSE, € of (10.34) equals lim; ,55 Var (e(t)). Hence by taking 


t — oo on (10.38), we have, 


£y, = (a — kJ Eso + a2 +k? 02, (10.39) 

which after rearranging becomes, 
E Cu 10.40 
0 pr PM 


This illustrates how the steady MSE depends on k. One can then minimize £55, which in the case of 
our parameter settings is minimized at k = 0.3. Kalman filtering yields a way of carrying out this 
minimization in an efficient and systematic manner, even for much more complex systems. A plot 
of (10.40) is in red curve in the right hand plot of Figure [10.11] We also attempt using different 
values of k via Monte Carlo to see the validity of (10.40) These are the scattered points around the 
curve, each obtained from a simulation run of T' = 10? time units where the MSE is estimated after 
throwing away the first 104 observations for ‘warm up’. 


Listing|10.13|creates Figure|10.11|with the main function, luenbergerTrack () used both for 
creating the short horizon trajectory on the left plot, and the long term Monte Carlo estimates on 
the right plot. 
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Listing 10.13: |Kalman filtering 


using Distributions, LinearAlgebra, Random, Measures, Plots; pyplot () 
Random. seed! (1) 


Ay WEL, venata = 0.8, O36, 1,0 

X0, spikeTime, spikeSize = 10.0, 50, -20.0 
Tsmall, Tlarge, warmTime = 100, 106, 1074 
kKalman = 0.3 


«oo-1oc»o0cumÉotr-z- 


function luenbergerTrack(k, T, spikeTime - Inf) 
x, Moat = 20, OW. 
MIE, See Mica, seg = (DX [pease soe ||, [DO 


PRR 
Or c 





= 
w 


for t in usi 
X = axX + rand(Normal(0,sqrt (varXi) ) ) 
Y = X + rand(Normal (0, sqrt (varZeta) ) ) 
Xhat =axXhat - kx*(Xhat - Y) 


hhh 
ZO OA 


¡3 
oo 


push! (xTraj, X) 
push! (xHatTraj, Xhat) 
push! (yiragq, Y) 


j= 
© 





NNN 
V = © 


if t == spikeTime 
X += spikeSize 
end 
end 
deleteat! (xHatTraj, length (xHatTraj)) 


Mee, kec ta, yira 


h2 h2 h2 b2 h2 WY 
0 Noan A Ww 


end 


N 
e 





w 
=] 


smallTraj, smallHat, smallY = luenbergerTrack(kKalman, Tsmall, spikeTime) 


w w 
N = 


pl scatter(smallTraj, c=:blue, 

ms=3, msw=0, label="System trajectory") 
pl scatter! (smallY, c = :black, 

ms=3, msw=0, label-"Measurments") 
pl scatter! (smallHat,c=: red, 

ms=3, msw=0, label="Kalman filter tracking", 
xlabel = "Time", ylabel = "Temperature", 
xlims=(0, Tsmall) 
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kRange = 0.2:0.005:0.4 

errs - [] 

for k in kRange 
xTraj, xHatTraj, _ = luenbergerTrack(k, Tlarge) 
mse = norm(xTraj[warmTime:end] - xHatTraj[warmTime:end])^2/(Tlarge-warmTime) 
push!(errs, mse) 

end 


LAR dm 4m ADA 
(6 00 -1 O OR 65 ho 


analyticErr(k) =(varXi + k^2xvarZeta) / (1-(a-k)^2) 


Oct 
= © 


p2 = scatter (kRange,errs, c=:black, ms=3, msw=0, 
xlabel="k", ylabel="MSE", label = "Monte Carlo") 
= plot! (kRange, analyticErr. (kRange), c = :red, 
xlabel="k", ylabel="MSE", label = "Analytic", ylim =(0.58,0.64) ) 


al 
N 





gl 
[v 





at at Ct 
Doe 


plot(pl, p2, size=(1000,400), margin = 5mm) 
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In lines 4-7 we define system, filtering, and simulation parameters. In lines 9-28 we define the 
luenbergerTrack() function for simulating a trajectory of time horizon T with gain k. The 
parameter spikeTime is used for introducing an exogenous spike, however with the default value 
Inf, such a spike doesn’t occur. Lines 14-15 directly implement and line 16 implements the 
filter (10.36). Line 16 uses deleteat!() to remove the last observation from xHatTraj. This 
agrees with the fact that it is initialized with two values in line 11. This is because it is a prediction of 
the next state X at every time. In line 30 we create the trajectory used for the left plot. It is uses the 
optimal gain, kKalman, specified earlier. Then lines 32-39 are used to create the left plot. Lines 41-47 
empirically try a sequence of gain values over the range kRange. For each, a long term simulation 
is executed. Observe the use of norm() from LinearAlgebra for estimating mse. Line 49 defines 


analyticErr () that directly implements (10.40). The remainder of the code creates the right hand 
plot and combines the plot into Figure [10.11 








10.5 Network Reliability 


We now briefly touch on the field of network reliability via simple examples. This discipline 
deals with the analysis of the reliability of systems composed of interconnected components. See 
for example for an introduction. Examples of systems that can be analyzed via network 
reliability models are road networks, electric power grids, computer networks, and other systems 
which can be described with the aid of (combinatorial) graphs. A graph is a collection of vertices 
and edges, where the edges describe connections between vertices. See for example Figure[10.12] As 
a simple application assume the graph represents a road network, where the edges represent roads 
and the vertices represent towns. 


In the context of network reliability, after a graph is used to model relationships between com- 
ponents of the network, a probabilisitic model is imposed on the graph. With such a model, certain 
edges or nodes are subject to failure/repair, sometimes in a dynamic manner. The reliability of 
the network is then some statistical summary of the probability model quantifying performance 
measures. 


As an example, consider the road network of Figure and say we wish to have an active 
path between towns A and D. For this example there are three possible paths. However, what 
if the roads were subject to failure? In this case, a standard network reliability question may be: 
what is the probability of connectivity between towns A and D. Say we use a simplistic probability 
model which assumes that at a snapshot of time, each road is in a failed state with probability p, 
independently of all other roads. Hence the reliability of the network as a function of p is, 


r(p) = P(There is a path from A to D). 


In the case of our simplistic network depicted in Figure [10.12| we can actually compute r(p) 
analytically as follows, 





r(p) = 1 — P(There does not exist a path from A to D) 
= 1 — P(A > B > D is broken) P(A > D is broken) P(A > C — D is broken) 
=1=(i=(1=9%p (1=(1=p)") 
=1- pp 2). 
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Figure 10.12: A graph with vertices (A, B, C, D) and edges 
T=(4,B),2=1(4,0),3=(4,D),4=18B,.D),5 = (C, D). 


The key in the above computation is the fact that each path does not share edges with any other 
path. For example the path A — B > D and the path A — C — D don’t intersect on any edges. 
This allows to move from the first line of the derivation to the second line. Afterwards, individual 
components can be calculated via 


P(A — B > D is broken) = 1 — P(A — B is not broken)P(B — D is not broken) = 1 — (1— p)?, 


and similarly for P(A — C — D is broken). Hence in such a simple example we can derive an an- 
alytic expression for the reliability of this network. However, for more complicated and interesting 
networks, this is not typically possible. This is because as redundancy emerges, strong dependen- 
cies exist between paths that share edges. A straightforward alternative approach to evaluate the 
reliability of the network is to use brute-force for generating many replications of random instances 
of the network, verifying if a path exists for each, and estimating the proportion from the Monte 
Carlo simulations. 


We carry out an example of this brute-force method via Monte Carlo simulation in Listing [10.14] 
The estimates obtained are then compared with the solutions given by r(p) = 1— p?(p— 2)?, and the 
results plotted in Figure [10.13] Note that the functions defined in this code listing are not limited 
to the simple network of Figure [10.12] but are applicable to other networks through straightforward 
modifications of lines 19 and 20 by specifying a different adjacency list, source, and destination. 


Graphs can be directed or undirected. Here we deal here with undirected graphs meaning that 
edges between vertices don't have a specified direction. In both cases, and common way to represent 
a graph is via an adjacency matrix. For a graph with L vertices, the L x L adjacency matrix R is 
defined to have entries, 


1 if edgei— j is in th h, 
Ri; - if edge 7 > 7 is in the grap (10.41) 


0 if edge i  j is not in the graph. 
Since we are dealing with undirected graphs, the matrix R is always symmetric. With a graph 


represented via R, it turns out that for any integer / > 1, the i,j entry of the matrix power Rf is 
the number of paths of length £ from vertex à to vertex j. You can verity this for £ = 2 via, 


E 
[R7] ij >, Rip Rj. 
k=1 


We can then use this elegant property of adjacency matrices and their powers in Listing to 
compute if a path exists between source and destination in a graph by checking for paths of length 
L= 1,..., L. An alternative is to modify R by setting diagonal elements Ri = 1. Then all that is 
needed to see if there is a path is to check from i to j is to consider L’th power, R^. 
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Figure 10.13: The reliability function of a simple network. 


Listing 10.14: ¡Simple network reliability 


using LinearAlgebra, Random, Plots; pyplot () 
Random. seed! (1) 


= 10^4 
edges = (1,2), (1,3), (2%, (1,4); (8,4) 1 
L = maximum (maximum. (edges) ) 
source, dest = 1, L 


O0 -1O» C' & C2 2 HE 


function adjMatrix (edges, L) 
R = zeros(Int, L, L) 
for e in edges 
RI elti; el21 1, RI el21, 
end 
R 





end 


pathExists(R, source, destination) = sign. ((I+R)”*L) [source, destination] 
randNet (p) = randsubseq (edges, 1-p) 











relEst (p) = sum([pathExists (adjMatrix (randNet (p),L),source,dest) for _ in 1:N])/N 
relinealytic (jo)  J-19^3:6(09-2) ^92 





pGrid = 0:0.05:1 

scatter (pGrid, relEst. (pGrid), 
c=:blue, ms=5, msw=0,label="Monte Carlo") 

plot! (pGrid, relAnalytic. (pGrid), 
c=:red, label="Analytic", xlims=(0,1.05), ylims=(0,1.05), 
xlabel-"p", ylabel="Reliability") 
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Figure 10.14: The undirected graph used in example Listing [10.15 





In lines 4-7 we define parameters including the list of edges via the edges array in accordance with 
Figure [10.12] In this array, we represent vertex A from the figure by 1, vertex B by 2, and so forth. 
'The total number of vertices of this graph L, is set by first broadcasting maximum() onto each tuple 
and then taking the maximum(). In lines 9-15 we implement the adjMatrix() function, which 
takes an array of pairs (edges) as input, and from those edges creates an adjacency matrix. In line 17 
we implement the pathExists() function, which checks if there is a path between the source 
and destination vertices in the graph represented by the adjacency matrix R. Prior to taking the 
L'th matrix power, we augment the adjacency matrix by setting 1 for entries on the diagonal via the 
addition of the identity matrix I. We then broadcast sign() to see if entries are 0 or greater than 0. 
In line 18 we use randsubseq() from Random to implement the randNet () function. It retains 
an edge from edges with probability 1-p and otherwise removes it in accordance with our reliability 
model. This is carried out to each edge independantly. In line 20 we implement the relEst () 
function which composes randNet (), adjMatrix(), and pathExists() to check if there is a 
path on a random instance of the network. This is repeated for N separate, independently simulated 
networks via a comprehension and the Monte Carlo proportion estimate is returned. In line 21 the 
analytic equation r(p) is defined as relAnalytic(). The remainder of the code executes these 
relEst() and relAnalytic() over a grid of probabilities to create Figure [10.12] which compares 
the analytic solution with Monte Carlo based estimates. 

















A Dynamic Reliability Example 


We now look at a dynamic reliability model. Instead of assuming a static setting where each 
edge fails with probability p, we introduce a time component to the model making it dynamic. We 
assume that at time 0 all edges of the network are operating (not broken) and that the lifetime of 
individual edges are i.i.d. exponentially distributed random variables, with parameter A (assumed 
the same for all edges for simplicity). Then as time progresses, edges fail one after the other based 
on their lifetimes. At any given time, the network state X (t) is the collection of edges that are still 
operating. 


For a specific source and destination specification and given the network state X(t), we can 
check at any time t if there is still a path between source and destination. We then define the 
failure time, or network life time via, 


T =inf{t 2 0 : There isn't a path between source and destination in X(t)]. 
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Density 





Network Life Time 


Figure 10.15: Comparison of the distribution of time until failure 


for A = 0.5 and A = 1. 


One important reliability function is then the expected network life time, 





r(A) = El7]. 











Further, we may be interested in the distribution of the network lifetime as influenced by A. In 
general, we can expect that as A increases, this distribution be more concentrated near 0 and that 
r(A) decrease. This is because with higher A, individual edges tend to fail more quickly. An example 
of two such distributions are in Figure [10.15] generated by Listing [10.15] for the example network in 
Figure [10.14] with source being A and destination F. 


Evaluating r(A) or the distribution of 7 analytically is typically not possible. Instead, we resort 
to Monte Carlo simulation. With the i.i.d. assumption on edge life times, the network state, X (t) 
is a well understood stochastic process as it can be described by a continuous time Markov chain 
(CTMC) where at any given time t, X(t) denotes the set of operating edges. We can then use the 
Doob-Gillespie algorithm first introduced in Listing [10.7] for such a network. In doing so we observe 
that times between state changes are distributed exponentially, with a rate A- E (X (t)), where 
E(-) counts the number of edges in the network. For example, to begin with X(0) = (1,2,...,10) 
which is the full edge set of Figure and the time until the first failure event is distributed 
exponentially with parameter 10A. After the first random edge fails, the time until the next failure 
event is distributed exponentially with parameter 9A, and so forth. The failure time is then the 
first point in time t for which the set of edges X(t) does not support a path from A to F. Also, in 
each iteration, in accordance with the CTMC theory that supports the Doob-Gillespie algorithm, 
we can uniformly select an edge from X(t) to delete. 


The implementation of Doob-Gillespie for this network in Listing [10.15] uses the LightGraphs 
package for handling the graph. This package allows to encapsulate the representation, modification, 
and analysis of graphs. The implementation uses an adjacency list to represent a graph which is an 
alternative data structure to the adjacency matrix from (10.41). Here for each node, we keep a list 
of nodes that are adjacent to it. Our code uses the adjacency list representation as apparent via 
the use of the fadjlist field of Graph objects. 
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Listing 10.15: |Dynamic network reliability 


using LightGraphs, Distributions, StatsBase, Random, Plots, LaTeXStrings;pyplot () 
Random. seed! (0) 


function createNetwork (edges) 
network = Graph (maximum (maximum. (edges) ) ) 
for e in edges 
add_edge! (network, e[1], e[2]) 
end 
network 
end 


function uniformRandomEdge (network) 
outDegrees = length. (network.fadjlist) 
randI = sample (1:length (outDegrees) ,Weights (outDegrees) ) 
randJ = rand(network.fadjlist[randI]) 
randI, randJ 
end 





function networkLife (network, source, dest, lambda) 
failureNetwork = copy (network) 
t = 0 
while has_path(failureNetwork, source, dest) 
t += rand (Exponential (1/ (failureNetwork.ne*lambda) ) ) 
i, j = uniformRandomEdge (failureNetwork) 
rem_edge! (failureNetwork, i, j) 
end 
E 
end 








lambdal, lambda2 = 0.5, 1.0 
roads = IL (Lo 2 7 (3.5 3) p (2,4), (273) y (25 8) ¢ CHADE (35 9) y (4,5), (4,6), (5, )) | 
source, dest = 1, 6 
network = createNetwork (roads) 
= EOS 





failTimesl1 = [ networkLife(network,source,dest,lambdal) for _ in EN 
failTimes2 [ networkLife (network, source,dest,lambda2) for _ in 1:N ] 





println("Edge Failure Rate = $(lambdal): Mean failure time 

mean (failTimes1), " days.") 

println("Edge Failure Rate = $(lambda2): Mean failure time 
mean(failTimes2), " days.") 











stephist (failTimesl, bins=200, c=:blue, normed-true, label=L"\lambda=0.5") 
stephist! (failTimes2, bins=200, c=:red, normed=true, label=L"\lambda=1.0", 
xlims=(0,5), ylims=(0,1.1), xlabel="Network Life Time", ylabel = "Density") 














1.4471182849093784 days. 


Edge Failure Rate = 0.5: Mean failure time 
1.5 0.48129663793885885 days. 


Edge Failure Rate 





Mean failure time 
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In lines 4-10 we implement the createNetwork () function, which creates a code Graph object from 
the LighGraphs package based on a list of edges. In line 5, the Graph () constructor is called, for 
which maximum (maximum. (edges) ) defines the number of vertices. Then in lines 6-8 a for loop is 
used to loop over each element of edges and add it to the graph via the add_edge! () function from 
LightGraphs. The return value in line 9, network, is a graph object. In lines 12-17 we implement 
uniformRandomEdge (), which takes a graph object from LightGraphs and returns a random 
uniformly selected edge (in the form of a tuple). In line 13, outDegrees is set by broadcasting 
length () to each element of network.fadjlist, ie. to each element of the adjacency list. This 
sets outDegrees as an array counting how many edges point out from each of the vertices. In line 14 
we set randI to be an index of a vertex by sampling with weights based on outDegrees. Then 
line 15 sets randJ. In line 16, the tuple, (randI,randJ) is returned which is guaranteed to be 
uniformly selected from the edges due to this sampling strategy. In lines 19-28 we implement the 
networkLife() function, which takes a network as input, and then degrades it according to a 
Poisson process at rate lambda. At each state it checks if a connection exists between source and 
destination, and returns the time when a path no longer exists. First, in line 20 the copy () 
function is used to create a copy of network. This is because network is passed by reference and we 
wish to degrade a copy of it, fai lureNetwork, and not the original network. Then in lines 22-26, the 
LightGraphs function has, path () is used to see if the network has a path from source to dest. 
Between each iteration, we wait for a duration that is exponentially distributed with a rate proportional 
to the number of edges (failureNetwork.ne). Then in line 23 uniformRandomEdge () is used 
to choose an edge, and in line 25 this is then removed via rem edge! (). Two example A values are 
set in line 30 and in line 31the network shown in Figure [10.14]is defined. It is created into a Graph 
object in line 33. This simulations are executed in lines 36-37 by using the networkLife() function. 
'The remaining lines summarize the simulation data in the form of text output and histograms. 











10.6 Common Random Numbers and Multiple RNGs 


More than half of the examples in this book involved some sort of (pseudo-)random number 
generation, often for the purpose of estimating some parameter, or performance measure. In such 
cases, one wishes to make the process as efficient as possible, i.e. one wishes to reduce the number 
of computations performed. However, there is an inherent tradeoff at play, since by reducing the 
number of computations one also reduces the confidence in the value of the parameter. Hence the 
concept of variance reduction is often employed to reduce the number of simulation runs, while 
maintaining the same precision of the parameter of interest. In this section we focus on one such 
technique called common random numbers. 


We have actually already used this technique in several examples; see for example Listing [7.3 
In these cases, the seed was fixed via Random.seed! () and a parameter was varied over some 
desired range. This approach often resulted in much smoother curves for the phenomena at interest. 


In order to gain more insight into this common random numbers approach, consider the random 
variable with a distribution parameterized by A, 


X ~ Uniform(0, 2A(1 — A)). (10.42) 


Clearly, 








EX[X] 2 A(1 — A). 
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— Expected curve 
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——— CRN estimate 





À 


Figure 10.16: Using common random numbers. 


Hence for this example, it is immediate that the expectation is maximized when A* — 1/2, which 
yields E)*[X] = 1/4. Now, for illustrative purposes, say we are not able to find A* analytically and 
wish to find this optimal A using simulation. 














To do so, you simulate n copies of X for each A in some grid over (0, 1), and for each A obtain 
an estimate via, 














——— 1 
RA) = EX] = — wee (10.43) 


where x” is a copy of the random variable with parameter A. You then we choose À* as the A 
with maximal m(A). 


Such a straightforward approach to simulation repeats the evaluation of M(A), and uses different 
independent random values each time. This would be the behavior if rand() was simply used 
repetitively, and the seed was not set between each evaluation. Such an approach effectively implies 
(assuming ideal random numbers) that for each A, each evaluation of M(A) is independent of the 
other evaluations. 


The method of common random numbers is to use the same random numbers, i.e. a stream of 
random numbers for every A over the grid. Mathematically this can be viewed as fixing an wg in the 
probability sample space 2 (see Section and re-evaluating the estimate M(A, wo) for all values 
of A. The idea is motivated by the assumption that for near parameter values, say Ag and A1, the 
estimate of (Ao, wo) and m(Aj1, wo) don’t significantly differ. Hence by using the same wo for both 
of these near values, a form of continuity on the estimated curve appears. 


In Listing[10.16|we consider the example of estimating the maximizer A* from (10.43), and com- 
pare estimates obtained naively using different random numbers each time with estimates obtained 
via the use of common random numbers. The results shown in Figure illustrate that for 
estimates obtained using common random numbers, the neighboring estimates do not differ greatly, 
(much less variance is observed), and the estimates are much closer to the true parameter values. 
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Listing 10.16: |Variance reduction via common random numbers 


using Distributions, Random, Plots, LaTeXStrings; pyplot () 


seed = 1 
= 100 
lamcrie = 0,0120, 01.20.99 


theorM(lam) = mean(Uniform(0,2x*lamx(1-lam))) 
estM(lam) = mean (rand (Uniform(0,2*1lamx (1-lam)),N)) 


function estM(lam, seed) 
Random.seed! (seed) 
estM(lam) 

end 


trueM = theorM. (lamGrid) 
estMO = estM. (lamGrid) 
estMCRN = estM. (lamGrid, seed) 


plot (lamGrid,trueM, 
c=:black, label="Expected curve") 
plot! (lamGrid,estM0, 
c=:blue, label="No CRN estiamte") 
plot! (lamGrid, estMCRN, 
c=:red, label="CRN estimate", 
xlims=(0,1), ylims=(0,0.4), xlabel=L"\lambda", ylabel = "Mean") 




















In line 7 we define the function theorM() which returns the theoretical mean, A(1 — A) by using the 
mean () method for a uniform random variable from Distributions.jl. In line 8 the function 
estM() is defined, which creates a sample of n random variables and computes their sample mean. 
In lines 10-13 we define an additional method for estM(). This method takes two arguments, the 
second one being seed. It sets the random seed in line 11 and then estimates the sample mean via 
the function of line 8. In line 15 the theoretical means are evaluated over the grid lamGrid, and the 
vector is set as trueM. In line 16 estM() is used to estimate the means over 1amGrid without the use 
of common random numbers. In line 17 the second method of estM() is used to estimate the means 
over lamGrid through the use of common random numbers. This way the same stream of random 
numbers are used in each estimate. The remainder of the code is used to create Figure 








The Case for Using Multiple RNGs 


We now consider another example, with the purpose of showing that in addition to the benefit 
of using common random numbers, there may sometimes be benefit from using multiple random 
number generators (RNGs) instead of a single RNG. Such practice is often employed in complex 
simulations, however here we illustrate a simple example. 


Extend the previous example by considering a random sum. For this consider the random 
variable, 


Xaş Z; (10.44) 
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—— No CRN 
— CRN and one RNG 
— CRN and two RNG's 


Mean 








A 


Figure 10.17: The effect of using two RNGs together with common random 
numbers: The blue curve is obtained with two RNGs and shows better 
performance than the red curve (no common random numbers) and the green 
curve (single RNG with common random numbers). 


where N ~ Poisson(KA) and Z; ~ Uniform(0,2(1—A)) with A € (0,1) and K > 0. In this case, 
it is possible to show that 











E [X] = KA(1— A). 





Like the previous example, here it is easy to see that the expectation is maximized when A* = 











1/2. In which case, E,+[X] = K/4. However again, say that for illustration purposes, we wish to 





find this optimal À using simulation. In this case we may again simulate n copies of X for each A in 
some grid on (0, 1) and for each A, obtain an estimate just like before via (10.43). We then choose 
A* as the A with maximal m(A). 


Like the previous example, it may be of interest to employ common random numbers. However in 
this case, as we demonstrate, unless we maintain different random number streams for N and (Zi), 
the effect of common random numbers turns out to be almost insignificant. In this specific example 
this is due to the fact that the number of random numbers used to generate each X from 
varies for each sample. Further if A is modified and as a consequence samples of N are modified, 
the original random numbers that were used for a given Z; are shifted. This can effectively break 
the desired effect of common random numbers. The phenomena is illustrated in Figure 


In Listing [10.17] we create Figure [10.17] where we also create our own Poisson random number 
generation function, prn() in line 6. It turns out that visually observing the benefit of multiple 
RNGs in this example works with our naive (quantile based) random number generator, but does 
not work with any Poisson random number generator. The listing also repeats the generation M 
times, to obtain estimates of the standard deviation of A*. 'l'his output serves as further proof that 
there is some benefit for using multiple random number generators. 
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Listing 10.17: |A case for two RNGs 


using Distributions, Random, Plots, LaTeXStrings; pyplot () 


No IX, M= 10°%2, 50, 10°53 
lamRange = 0.01:0.01:0.99 


prn (Lambda, rng) = quantile (Poisson(lambda) , rand(rng) ) 
zDist (lam) = Uniform(0, 2x (1-lam) ) 


rv(lam,rng) = sum([rand(rng,zDist(lam)) for in 1:prn(Kxlam, rng) ]) 


rv2 (lam, rngl,rng2) = sum([rand(rngl,zDist(lam)) for _ in 1:prn(Kxlam,rng2) }) 


Est (lam,rng) = mean([rv(lam,rng) for _ in 1:N]) 
Est2(lam,rngl,rng2) = mean([rv2(lam,rngl,rng2) for _ in 1:N]) 





function mGraph0 (seed) 
singleRng = MersenneTwister (seed) 
[mEst (lam, singleRng) for lam in lamRange] 
end 
mGraphl (seed) = [mEst (lam, MersenneTwister (seed)) for lam in lamRange] 
mGraph2 (seedl, seed2) = [mEst2 (lam, MersenneTwister (seedl), 
MersenneTwister (seed2)) for lam in lamRange] 




















argMaxLam(graph) = lamRange[findmax (graph) [2] ] 


std0 = std([argMaxLam(mGraph0 (seed)) for seed in 1:M]) 
stdl std([argMaxLam(mGraphl(seed)) for seed in 1:M]) 
std2 std ([argMaxLam (mGraph2 (seed, seed+M)) for seed in 1:M]) 





println("Standard deviation with no CRN: ", std0) 
println("Standard deviation with CRN and single RNG: ", stdl) 
println("Standard deviation with CRN and two RNGs: ", std2) 


plot (lamRange,mGraph0 (1987), 
c=:red, label="No CRN") 
plot! (lamRange,mGraph1 (1987), 
c=:green, label="CRN and one RNG") 
plot! (lamRange,mGraph2 (1987,1988), 
c=:blue, label="CRN and two RNG’s", xlims=(0,1),ylims=(0,14), 
xlabel=L"\lambda", ylabel = "Mean") 





Standard deviation with no CRN: 0.037080520020152975 
Standard deviation with CRN and single RNG: 0.03411444555309958 
Standard deviation with CRN and two RNGs: 0.014645353747396726 
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In line 3 we define N, the number of repetitions to carry out for each value of A; the constant K; and 
the number of repetitions to carry out in total for estimating the argmax, M. In line 6 we define our 
function prn (). It uses the inverse probability transform to generate a Poisson random variable with 
parameter lambda and with a random number generator rng. In line 7 we define a function for 
creating a Uniform(0, 2(1 — A)) distribution. In lines 9-10 we create the two central functions for this 
example. The function rv () uses a single random number generator to generate the random variable 
(10.44). Then the function rv2() achieves this with two random variables. One for the uniform 
random variables and one for the Poisson random variable. Lines 12-13 create the functions mEst () 
and mEst2(). The first uses a single random number generator and the second uses two random 
number generators. Lines 15-21 define the functions mGraph0 (), mGraphl1 () and mGraph2 () for 
obtaining trajectories of the estimate for each A in lamRange. The mGrapho0 () function uses a single 
RNG and no common random numbers, the function mGraph1 () uses common random numbers reset 
each time on the same seed, and the mGraph2 () function uses two RNGs. In line 28 argMaxLam() 
picks the maximum. The standard deviations are estimated in lines 25-27. The remainder of the code 
prints the output and creates Figure 











Appendix A 


How-to in Julia - DRAFT 


The code examples in this book are primarily designed to illustrate statistical concepts. However, 
they also have a secondary purpose. They serve a way of learning how to use Julia by example. 
Towards this end, the appendix links language features with specific code listings in the book. This 
appendix can be used on an ad-hoc basis to find code examples where you can see “how to” do 
specific things in Julia. Once you find the specific “how to” that you are looking for, you can refer 
to its associated code example, referenced via “=”. This appendix is also available at: 








The appendix is broken up into several subsections as follows. Basics (Section |A.1), deals with 
basic language features. Text and I/O (Section [A.2) deals with textual operations as well as input 
and output. Data Structures (Section |A.3), deals with data structures and their use. This in- 
cludes basic arrays as well as other structures. Data Frames, Time-Series, and Dates (Section [A.4) 
deals with Data Frames and related objects for organizing heterogeneous data. Mathematics (Sec- 
tion [A.5), covers various mathematical aspects of the language. Randomness, Statistics, and Ma- 
chine Learning (Section [A.6), deals with random number generation, elementary statistics, distri- 
butions, statistical inference, and machine learning. Graphics (Section [A.7), deals with plotting, 
manipulation of figures, and animation. 


A.l Basics 


Types 





Check the type of an object. 

=> Listing 

Specify the type of an argument to a function. 
> Listing 


Specify the type of an array when initialized using zeros (). 


> Listing 
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Convert the type of a variable with convert (). 


=> Listing 


Convert the type of a variable with a constructor like Int (). 

=> Listing 

Use a 32 bit float instead of default 64 bit float with the £0 float literal. 
— Listing 


Use big representation of numbers using big (). 


> Listing 


Check if a variable is immutable with isimmutable(). 


=> Listing 


Variables 
































Modify a global variable inside a different scope by declaring global. 
— Listing [1.5 


Assign two values in a single statement (using an implicit tuple). 


> Listing 


Copy a variable, array, or struct with copy (). 


— Listing 


Copy a variable, array, or struct with deepcopy (). 


> Listing 


Conditionals and Logical Operations 


















































Use the conditional if statement. 


=> Listing 


Use the conditional else statement. 


=> Listing 


Use the conditional elseif statement. 


=> Listing 


Use the shorthand conditional formatting operator ? 


=> Listing 


Carry out element-wise and using . &. 


> Listing 


Carry out element-wise negation using .!. 


> Listing 


Use logical or | |. 


=> Listing 








A.1. BASICS 395 











Use short circuit evaluation with logical and &&. 


=> Listing 





Loops 








Create a while loop. 
— Listing [1.11 





Loop over values in an array. 


> Listing 


Create nested for loops. 


— Listing 


Break out of a loop with break. 
> Listing 


Execute the next loop iteration from the top with continue. 
— Listing 2.5 
































Loop over an enumeration of (Index, value) pairs created by enumerate (). 


— Listing 





Functions 





Create a function. 


— Listing 


Create a one line function. 


— Listing 


Create function using begin and end. 


— Listing 


Create a function that returns a function. 


— Listing 


Pass functions as arguments to functions. 


=> Listing [10.10 


Create a function with a multiple number of arguments. 


— Listing 


Use an anonymous function. 


— Listing 


Define a function inside another function. 


— Listing 


Create a function that returns a tuple. 


— Listing 
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Setup default values to function arguments. 


=> Listing [10.10 


Other Basic Operations 






























































Check the running time of a block of code. 
— Listing 


Increment values using +=. 


> Listing 


Do element-wise comparisons such as for example using .>. 


— Listing 


Apply an element-wise computation to a tuple. 


> Listing 


Use the logical xor () function. 
— Listing 2.12 





Set a numerical value to be infinity with Inf. 


— Listing 


Include another block of Julia code using include(). 


— Listing 


Find the maximal value amongst several arguments using max (). 


— Listing 


Find the minimal value amongst several arguments using min(). 
=> Listing |5.20 


Metaprogramming 














Define a macro. 


> Listing 


Interacting with Other Languages 
































Copy data to the R environment with @rput from package RCall. 
— Listing 


Get data from the R environment with @rget from package RCal1. 


— Listing 


Execute an R language block with the command R from package RCall. 


— Listing 


Setup a Python object in Julia using @pyimport from package PyCall. 
— Listing 
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A.2 Text and I/O 


Strings 








Split a string based on whitespace with split (). 
— Listing 


Use LaTeX formatting for strings. 
=> Listing 


See if a string is a substring of another string with occursin (). 


> Listing 


Concatenate two strings using +. 


> Listing 


























Text Output 








Print text output including new lines, and tabs. 


— Listing 


Format variables within strings when printing. 


=> Listing 


Display an expression to output using display (). 


— Listing 


Display an expression to output using show(stdout,...). 


— Listing 


Present the value of an expression with QG show. 


— Listing 


Display an information line with @info. 


— Listing 


Redirect the standard output to a file. 
— Listing 


















































Reading and Writing From Files 





Open a file for writing with open (). 
— Listing 


Open a file for reading with open (). 
— Listing 


Write a string to a file with write(). 


— Listing 
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Close a file after it was opened. 


=> Listing 


Read from a file with read (). 
— Listing [4.12 














Find out the current working directory with pwd (). 
> Listing 


See the list of files in a directory with readdir(). 
> Listing 

See the directory of the current file with @__DIR__ 
> Listing 


Change the current directory with cd (). 
> Listing 





























CSV Files 





Read a CSV file to create a dataframe with a header. 
> Listing 


Read a CSV file to create a dataframe with without a header. 
> Listing 


Write to a CSV file with CSV.write(). 
> Listing 























JSON 














Parse a JSON file with JSON.parse (). 
=> Listing [1.9] 


BSON 





Write to a BSON file. 
> Listing 


Read from a BSON file. 
> Listing 

















HTTP Input 





Create an HTTP request. 
> Listing 


Convert binary data to a string. 


=> Listing 
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A.3 Data Structures 


Creating Arrays 








Create a range of numbers. 


> Listing 


Create an array of zero values with zeros (). 


— Listing 


Create an array of one values with ones (). 
— Listing |2.4 

















Create an array with a repeated value using fill (). 


=> Listing 


Create an array of strings. 


=> Listing 


Create an array of numerical values based on a formula. 


> Listing 


Create an empty array of a given type. 


=> Listing 


Create an array of character ranges. 


=> Listing 


Create an array of tuples. 


=> Listing 


Create an array of arrays. 
> Listing [1.15 















































Basic Array Operations 





Discover the length () of an array. 


> Listing 


Access elements of an array. 


— Listing 


Obtain the first and last elements of an array using first () and last (). 


> Listing 


Apply a function like sqrt () onto an array of numbers. 


> Listing 


Map a function onto an array with map (). 


> Listing 
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Append with push! () to an array. 
— Listing 


Convert an object into an array with the collect () function. 


— Listing 


Preallocate an array of a given size. 


— Listing 


Delete an element from an array or collection with deleteat! (). 


— Listing 


Find the first element of an array matching a pattern with findfirst(). 


— Listing 


Append an array to an existing array with append! (). 


— Listing 


Sum up two equally size arrays element by element. 


— Listing 


Stick together several arrays into one array using vcat () and .... 


— Listing 


Further Array Accessories 






























































Sum up values of an array with sum(). 


— Listing 


Search for a maximal index in an array using findmax (). 


— Listing 


Count the number of occurrence repetitions with the count () function. 


— Listing 


Sort an array using the sort () function. 


— Listing 


Filter an array based on a criterion using the filter () function. 


— Listing 


Find the maximal value in an array using maximum(). 


— Listing 


Count the number of occurrence repetitions with the counts () function from StatsBase. 


— Listing 


Reduce a collection to unique elements with unique (). 


— Listing 


Check if a an array is empty with isempty(). 
— Listing 
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Find the minimal value in an array using minimum(). 


— Listing 


Accumulate values of an array with accumulate(). 


— Listing 


Sort an array in place using the sort! () function. 


— Listing 























Sets 





Check if an element is an element of a set with in(). 


— Listing 


Check if a set is a subset of a set with issubset(). 


— Listing 


Obtain the set difference of two sets with setdiff(). 
— Listing 2.5 




















Create a set from a range of numbers. 


— Listing 


Obtain the union of two sets with union(). 


— Listing 


Obtain the intersection of two sets with intersect(). 


— Listing 























Matrices 





Obtain the dimensions of a matrix using size(). 


— Listing 


Define a matrix based on a set of values. 


— Listing 


Define a matrix based on side by side columns. 


— Listing 


Raise a matrix to a power. 


— Listing 


Access a given row of a matrix. 


— Listing 


Stick together two matrices using vcat (). 


— Listing 


Take a matrix and/or vector transpose. 


— Listing 
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Modify the dimensions of a matrix with reshape (). 


=> Listing 


Use an identity matrix with I . 
— Listing [1.8 














Setup a diagonal matrix with diagm() and a dictionary. 


> Listing 


Obtain the diagonal of a matrix with diag () . 
> Listing 


Create a matrix by sticking together column vectors. 


> Listing 




















Tensors 











Work with a tensor. 


> Listing 





Dictionaries 








Access elements of a dictionary. 


> Listing 


Create a dictionary. 


> Listing 














Graphs 








Create Graph objects from the package LightGraphs. 
— Listing [10.11 


Add edges to Graph objects using add_edge! (). 
> Listing 


Remove edges from Graph objects using rem_edge! (). 


> Listing [10.11 




















Other Data Structures 








Setup a Queue data structure from package DataStructures. 


— Listing [10.11 


Insert an element to a Queue data structure using enqueue! (). 


— Listing [10.11 


Remove an element from a Queue data structure using dequeue! (). 


— Listing [10.11 
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A.4 Data Frames, Time-Series, and Dates 


Dataframe Basics 





Select certain rows of a DataFrame. 


=> Listing 


Select certain columns of a DataFrame. 


=> Listing 


Filter all rows of a DataFrame that using a boolean array. 


— Listing 


See if data all rows of a DataFrame that using a boolean array. 


— Listing 


Check for missing values using dropmissing(). 


— Listing 


Remove missing values using dropmissing(), removing any rows with missing values. 


— Listing 


Remove missing values using skipmissing() removing specific missing values. 


— Listing 


Sort a data frame based on a given column. 


— Listing 
























































R Data Sets 











Obtain a data frame from RDataSets with dataset(). 


— Listing 





Time Series 








Create a time-series (TimeArray) object. 
=> Listing [8.20 





Perform a moving average on a time-series. 


> Listing 


Compute the autocorrelation function estimate of a time-series. 


> Listing 

















Dates 














Parse dates using Dates .DateFormat. 


> Listing 
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Obtain the day of week from a date object. 
=> Listing 


Obtain the month from a date object. 
=> Listing 


Obtain the year from a date object. 
> Listing 


Create a Date object based on day, month, and year. 


=> Listing 


Mathematics 


Basic Math 
























































Compute the modulo (remainder) of integer division. 


=> Listing 


Check if a number is even with iseven (). 


=> Listing 


Take the product of elements of an array using prod (). 


> Listing 


Round numbers to a desired accuracy with round (). 


— Listing 


Compute the floor of value using floor (). 


— Listing 


Take the product of elements of an array using « with ... as “product”. 


— Listing 


Represent 7 using the constant pi. 


— Listing 


Represent Euler's e using the constant MathConstants.e. 


— Listing 





Math Functions 





























Compute permutations using the factorial() function. 


— Listing 


Compute the absolute value with abs (). 


— Listing 


Compute the sign function with sign(). 


— Listing 


As. 
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Create all the permutations of set with permutations () from Combinatorics. 


> Listing 

Calculate binomial coefficients with binomial (). 
> Listing 

Use mathematical special functions such as zeta (). 
> Listing 

Calculate the exponential function with exp (). 
— Listing 

Calculate the logarithm function with log (). 

=> Listing 

Calculate trigonometric functions like cos (). 

=> Listing 


Create all the combinations of set with combinations () from Combinatorics. 


> Listing 


Linear Algebra 











Solve a system of equations using the backslash operator. 
> Listing 


Use LinearAlgebra functions such as eigvecs (). 

































































> Listing 

Carry out a Cholesky decomposition of a matrix. 

> Listing 

Calculate the inner product of a vector by multiplying the transpose by the vector. 
> Listing 

Calculate the inner product by using dot (). 

> Listing 

Compute a matrix exponential with exp (). 

> Listing 

Compute the inverse of a matrix with inv (). 

> Listing 

Compute the Moore-Penrose pseudo-inverse of a matrix with pinv (). 
> Listing 

Compute the Lp norm of a function with norm(). 

=> Listing [8.2 

Compute the QR-factorization of a matrix with qr (). 

> Listing 


Compute the SVD-factorization of a matrix with svd(). 


> Listing 
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Numerical Math 











Find all roots of mathematical function using find zeros(). 


— Listing 


Find a root of mathematical function using find zero(). 


— Listing 


Carry out numerical integration using package QuadGK. 


— Listing 


Carry out numerical differentiation using package Calculus. 


— Listing 


Carry out numerical integration using package HCubature. 


— Listing 


Solve a system of equations numerically with nlsolve() from package NLSolve. 
=> Listing |5.8 





















































Numerically solve a differential equations using the DifferentialEquations package. 


=> Listing 


A.6 Randomness, Statistics, and Machine Learning 


Randomness 





Sample a random number using a prescribed weighting with sample (). 


=> Listing 


Get a uniform random number in the range (0, 1]. 


> Listing 


Set the seed of the random number generator. 
— Listing [1.14 




















Create a random permutation using shuffle! (). 


=> Listing 


Generate a random number from a given range with rand (). 
— Listing [2.9 














Generate an array of random uniforms with rand (). 


> Listing 


Generate a random element from a set of values rand (). 


> Listing 


Generate an array of standard normal random variables with randn (). 


=> Listing 
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Generate an array of pseudorandom values from a given distribution. 


— Listing 


Generate multivariate normal random values via MvNormal (). 


=> Listing 


Distributions 














































































































Creating a distribution object from the Distributions package. 


> Listing 


Evaluate the PDF (density) of a given distribution. 
=> Listing 


Evaluate the CDF (cumulative probability) of a given distribution. 
=> Listing 


Evaluate the CCDF (one minus cumulative probability) of a given distribution. 


> Listing 


Evaluate quantiles of a given distribution. 


=> Listing 


Obtain the parameters of a given distribution. 


> Listing 


Evaluate the mean of a given distribution. 


> Listing 


Evaluate the median of a given distribution. 


> Listing 


Evaluate the variance of a given distribution. 


> Listing 


Evaluate the standard deviation of a given distribution. 


> Listing 


Evaluate the skewness of a given distribution. 


> Listing 


Evaluate the kurtosis of a given distribution. 


> Listing 


Evaluate the range of support of a given distribution. 
— Listing [3.10 


Evaluate the modes (or modes) of a given distribution. 


> Listing 
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Basic Statistics 





Calculate the arithmetic mean of an array. 


=> Listing 


Calculate the geometric mean of an array. 


> Listing 


Calculate the harmonic mean of an array. 


=> Listing 


Calculate a quantile. 


> Listing 


Calculate the sample variance of an array. 


> Listing 
Calculate the sample standard deviation of an array. 


> Listing 


Calculate the median of an array. 
— Listing [4.11 












































Calculate the sample covariance from two arrays. 


> Listing 


Calculate the sample correlation from two arrays. 


— Listing 


Calculate the sample covariance matrix from a collection of arrays in a matrix. 


> Listing 























Statistical Inference 











Use the confint () function on an hypothesis test. 


> Listing 


Carry out a one sample Z test using the HypothesisTests package. 


> Listing 
Carry out a one sample T test using the HypothesisTests package. 


> Listing 


Carry out a two sample, equal variance, T test using the HypothesisTests package. 


— Listing 


Carry out a two sample, non-equal variance, T test using the HypothesisTests package. 


=> Listing 


Carry out kernel density estimation using kde () from package KernelDensity(). 


> Listing 


Create and Empirical Cumulative Distribution Function using ecdf (). 


=> Listing 
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Linear Models and Generalizations 














Create a formula for a (generalized) linear model with @formula. 


— Listing 


Fit a linear model with fit (), 1m(), or glm(). 
> Listing 


Calculate the deviance of a linear model with deviance (). 


=> Listing 


Get the standard error of of a linear model with stderror (). 


— Listing 


Get the E? value of a linear model with r2 (). 


=> Listing 


Get the fit coefficients of a (generalized) linear model with coef () . 


=> Listing 


Fit a logistic regression model using package GLM. 


> Listing 


Fit a GLMs with different link functions using package GLM. 
> Listing 


Fit a ridge regression model. 


=> Listing 


Fit a LASSO model. 
— Listing [8.17 






























































Supervised Classification 








Fit a Support Vector Machine (SVM) model. 
— Listing 


Fit a random forest model. 


> Listing 














Unsupervised Learning 





Carry out k-means clustering. 
— Listing [9.12 








Carry out hierarchical clustering. 


> Listing 


Carry out principal component analysis. 


> Listing 
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Deep Learning using Flux. 41 








Use Chain() to construct a deep learning model. 


=> Listing 


Use loadparams! () to fill the parameters of a model. 


=> Listing 


Use onecold() to retrieve a label. 


=> Listing 


Use onehotbatch () to create a one hot vector from a label. 


=> Listing 


Train a model with train! (). 


> Listing 


Set callback functions to be called during training. 


=> Listing 

Calculate a gradient using gradient (). 
— Listing 

Optimize using ADAM. 

> Listing 


Use the softmax () function. 


> Listing 


Use the crossentropy () function. 


=> Listing 


Use dropout. 
=> Listing 


Use the @epochs macro in training. 


=> Listing 


Use batch normalization. 


— Listing 






















































































A.7 Graphics 


The book contains 126 figures that are generated by the source code and presented in the book 
‘as-is’. These can also viewed on one page in this online gallery: 


https://statisticswithjulia.org/gallery.html 





You can use this online gallery to then find an image that contains the type of features that you 
want to include in your figures and refer to the online source code. 


Appendix B 


Additional Julia Features - DRAFT 


The code examples in the book use a variety of Julia language features. However these examples 
are purposefully short and do not exploit the full power of Julia as a general programming language. 
To fully explore Julia, you should be aware of other features not exploited in our code examples. 
Below is a list of key additional language features. For full documentation of each of these features, 


consult the official Julia documentation: https://docs.julialang.org 


Constant values: Global variables can be defined as constants with the const keyword. 


Creation of packages: The nature of our code examples is illustrative, allowing them to run on a 
standard environment without requiring any special installation. However, once you create 
code that you wish to reuse, you may want to encapsulate it in a Julia package. This is done 
via the generate command in the package manager. 


Documentation: It is easy to document a function by placing a markdown docstring above the 
function definition. 


Environments: You may create different working environments where each environment has its 
own set of packages and versions installed. The key files that define an Environments are 
Project.toml and Manifest.toml. 


Exception handling: Julia has built-in exception handling support. A key mechanism is the try, 
throw and catch construct, allowing functions to throw () an exceptions. 


GPU support: There are plenty of mechanisms for integrating GPU support. 


Interfaces: Much of Julia’s power and extensibility comes from a collection of informal interfaces. 
By extending a few specific methods to work for a custom type, objects of that type not 
only receive those functionalities, but they are also able to be used in other methods that 
are written to generically build upon those behaviors. Iterable objects are particularly 
useful, and we have used them in several of our examples. In addition, there are methods for 
indexing, interfacing with abstract arrays and strided arrays, as well as ways of customizing 
broadcasting. 


Low level TCP/IP sockets: Julia supports TCP and UDP sockets via the Sockets. jl package, 
which is installed as part of Julia Base. The methods will be familiar to those who have 
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used the Unix socket API. For example, server = listen(ip"127.0.0.1", 2000) 
will create a localhost socket listening on port 2000, connect (ip"127.0.0.1", 2000) 
will connect to the socket, and close (server) will disconnect the socket. 


Metaprogramming: Julia supports ‘Lisp like’ metaprogramming, which makes it possible to create 
a program that generates some of its own code, and to create true Lisp-style macros which 
operate at the level of abstract syntax trees. As a brief example, x = Meta.parse("1 + 
2") parses the argument string into an expression type object and stores it as x. This object 
can be inspected via drop (x) (note the + symbol, represented by :+). The expression can 
also be evaluated via eval (x), which returns the numerical result of 3. 


Multiple dispatch: While some of our examples use multiple dispatch, we have not unleashed its 
full power. The ability to create different methods for the same function sits at the heart of 
the Julia programming paradigm and allows users to extend packages in a very productive 
manner. 


Modules: Modules in Julia are different workspaces that introduce a new global scope. They 
are delimited within module Name ... end, and they allow for the creation of top-level 
definitions (i.e. global variables) without worrying about naming conflicts when used together 
with other code. Within a module, you can control which names from other modules are 
visible via the import keyword, and which names are intended to be public via the export 
keyword. 


Parallel processing: Julia supports a variety of parallel computing constructs including green 
threads, tasks (known as coroutines in Julia) and communication channels between them. A 
basic macro is @async which when used via for example, @async myFunction(), would 
execute myFunction() on its own thread. 


Profiling. There are a variety of profiling tools that allow to improve performance. One key 
aspect of performance that we have mostly ignored is type stability. See for example the 
@code_warntype macro which helps to check for type stability. 


Rational numbers: Julia supports rational numbers, along with arbitrary precision arithmetic. A 
rational number such as for example 2/3 is defined in Julia via 2//3. Arithmetic with rational 
numbers is supported. 


Regular expressions: Julia supports regular expressions, allowing to match strings. For example 
occursin(r"*\sx (#)", "£f a comment") checks if # appears in the string and returns 
true. 


Running external programs: Julia borrows backtick notation for commands from the shell, Perl, 
and Ruby. However, the behavior of ‘Hello world" varies slightly from typical shell, Perl 
or Ruby behavior. In particular, the backticks create a Cmd object, which can be connected to 
other commands via pipes. In addition, Julia does not capture the output unless specifically 
arranged for it. And finally, the command is never run with a shell, but rather Julia parses the 
syntax directly, appropriately interpolating variables and splitting on words as the shell would, 
respecting shell quoting syntax. The command is run as Julia’s immediate child process, 
using fork and exec calls. As a simple example, consider: run (pipeline (‘echo world" 
& "echo hello", ‘sort*));. This always outputs ‘Hello world’ (here both echos are 
parsed to a singe UNIX pipe, and the other end of the pipe is read by the sort command). 
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Strings,: While some of our examples included string manipulation, we haven't delved into the sub- 
ject deeply. Julia supports a variety of string operations for example, occursin("world", 
"Hello, world") returns true. 


Unicode and character encoding: Most of the examples in the book were restricted to ASCII 
characters, however Julia fully supports Unicode. For example s = "\u2200 x \u2203 
y" yields the string V x 3 y. 





User defined types: In addition to the basic types in the system (e.g. Float 64), users and develop- 
ers can create their own types via the struct keyword. In our examples, we have not created 
our own types, however many of the packages define new structs and in some examples of the 
book, we have referred directly to the fields of these structs. An example is in Listing [8.3] we 
use F.Q to refer to the field “O” in the structure F. 


Unit testing: As reusable code is developed it may also be helpful to create unit tests for verifying 
the validity of the code. This allows the code to be retested automatically every time it is 
modified or the system is upgraded. For this Julia supports unit testing via the (test macro, 
the runtests () function and other objects. 


Appendix C 


Additional Packages - DRAFT 


We have used a variety of packages in this book. These were listed in Section However 
there are many more. Currently, as of the time of writing, there are just over 4,000 registered 
packages in the Julia ecosystem. Many of these packages deal with numerical mathematics, scientific 
computing, or deal with some specific engineering or technical application. There are hundreds of 
packages associated with statistics and/or data-science, and we now provide an outline of some of 
the popular packages in this space that have not been used in our examples. 


|ARCH. j1|is a package that allows for ARCH (Autoregressive Conditional Heteroskedasticity) mod- 
eling. ARCH models are a class of models designed to capture a features of financial returns 
data known as volatility clustering, i.e., the fact that large (in absolute value) returns tend to 
cluster together, such as during periods of financial turmoil, which then alternate with rela- 
tively calmer periods. This package provides efficient routines for simulating, estimating, and 


testing a variety of ARCH and GARCH models (with GARCH being Generalized ARCH). 


is an automatic differentiation package for Julia. It started as a port of the popular 
Python autograd package and forms the foundation of the Knet Julia deep learning frame- 


work. AutoGrad can differentiate regular Julia code that includes loops, conditionals, helper 
functions, closures etc. by keeping track of the primitive operations and using this execution 
trace to compute gradients. It uses reverse mode differentiation (a.k.a. back propagation) so 
it can efficiently handle functions with large array inputs and scalar outputs. It can compute 
gradients of gradients to handle higher order derivatives. 


BayesNets.jlis a package implements Bayesian Networks for Julia through the introduction of 
the BayesNet type, which contains information on the directed acyclic graph, and a list of 


conditional probability distributions (CDP’s). Several different CDP’s are available. It allows 
to use random sampling, weighted sampling, and Gibbs sampling for assignments. It supports 
inference methods for discrete Bayesian networks, parameter learning for an entire graph, 
structure learning, and the calculation of the Bayesian score for a discrete valued BayesNet, 
based purely on the structure and data. Visualization of network structures is also possible 
via integration with the TikzGraphs.4Jj1 package. 


Bootstrap. jl is a package for statistical bootstrapping. It has several different resampling 


methods and also has functionality for confidence intervals. 
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is a Julia package for Disciplined Convex Programming optimization problems. It 


can solve linear programs, mixed-integer linear programs, and DCP-compliant convex pro- 
grams using a variety of solvers, including Mosek, Gurobi, ECOS, SCS, and GLPK, through 
the MathOptInterface interface. It also supports optimization with complex variables and 
coefficients. 


is an unofficial interface to the IBM® ILOG® CPLEX® Optimization Studio. It 
provides an interface to the low-level C API, as well as an implementation of the solver- 
independent MathOptInterface.jl. You cannot use CPLEX. 41 without having purchased and 
installed a copy of CPLEX Optimization Studio from IBM. This package is available free 
of charge and in no way replaces or alters any functionality of IBM's CPLEX Optimization 
Studio product. 


[CUDAnative. jl|is part of the JuliaGPU collection of packages, and provides support for com- 


piling and executing native Julia kernels on CUDA hardware. 


DataFramesMeta.jljsa package that provides a series of metaprogramming tools for DataF rames. j1, 


which improve performance and provide a more convenient syntax. 


is a package for evaluating distances (metrics) between vectors. It also provides 


optimized functions to compute column-wise and pairwise distances. This is often substantially 
faster than a straightforward loop implementation. 


FastGaussQuadrature.jl|is a Julia package to compute n-point Gauss quadrature nodes 


and weights to 16 digit accuracy in O(n) time. It includes several different algorithms, 
including gausschebyshev (), gausslegendre(), gaussjacobi(), gaussradau(), 
gausslobatto(), gausslaguerre(), and gausshermite(). 


ForwardDiff. jl is part of the JuliaDiff family, and is a package that implements methods to 


take derivatives, gradients, Jacobians, Hessians, and higher-order derivatives of native Julia 
functions (or objects) using forward mode automatic differentiation (AD). 


GadFly.jl|is a plotting and visualization system written in Julia and largely based on ggplot2 


for R. it supports a large number of common plot types and composition techniques, along 
with interactive features, such as panning and zooming, which are powered by Snap.svg. It 
renders publication quality graphics in a variety of formats including SVG, PNG, Postscript, 
and PDF, and has tight integration with DataFrames. jl. 


is a package that acts as a wrapper for Fortran code from glmnet. Also see Lasso. j1 
which is a pure Julia implementation of the glmnet coordinate descent algorithm that often 
achieves better performance. 


Gurobi.jl|isa wrapper for the Gurobi solver (through its C interface). Gurobi is a commercial 


optimization solver for a variety of mathematical programming problems, including linear 
programming (LP), quadratic programming (QP), quadratically constrained programming 
(QCP), mixed integer linear programming (MILP), mixed-integer quadratic programming 
(MIQP), and mixed-integer quadratically constrained programming (MIQCP). It is highly 
recommend that the Gurobi.jl package is used with higher level packages such as JuMP. j1 
or MathOptInterface. jl. 


Interpolations. jljis a package for fast, continuous interpolations of discrete datasets in Julia. 





417 


JuliaDB.jl|is a package designed for working with large multi-dimensional datasets of any size. 


Using an efficient binary format, it allows data to be loaded and saved and efficiently, and 
quickly recalled later. It is versatile, and allows for fast indexing, filtering, and sorting oper- 
ations, along with performing regressions. It comes with built-in distributed parallelism, and 
aims to tie together the most useful data manipulation libraries for a comfortable experience. 





JuliaDBMeta.jljs a set of macros that aim to simplify data manipulation with JuliaDB.jl. 
is a domain-specific modeling language for mathematical optimization embedded in Ju- 


lia. It supports a number of open-source and commercial solvers (Artelys Knitro, BARON, 
Bonmin, Cbc, Clp, Couenne, CPLEX, ECOS, FICO Xpress, GLPK, Gurobi, Ipopt, MOSEK, 
NLopt, SCS) for a variety of problem classes, including linear programming, (mixed) integer 
programming, second-order conic programming, semi-definite programming, and non-linear 
programming (convex and non-convex). JuMP makes it easy to specify and solve optimiza- 
tion problems without expert knowledge, yet at the same time allows experts to implement 
advanced algorithmic techniques such as exploiting efficient hot-starts in linear programming 
or using callbacks to interact with branch-and-bound solvers. It is part of the JuliaOpt col- 
lection of packages. 


is a pure Julia implementation of local polynomial regression (i.e. locally estimated 


scatterplot smoothing, known as LOESS). 


is a package providing a small library of basic least-squares fitting in pure Julia. 


The basic functionality was originally in Optim.31, before being separated. At this time, 
LsqFit.jl only utilizes the Levenberg-Marquardt algorithm for non-linear fitting. 


provides a pure Julia interface to implement and apply Markov chain Monte Carlo 


(MCMC) methods for Bayesian analysis. It provides a framework for the specification of 
hierarchical models, allows for block-updating of parameters, with samplers either defined by 
the user, or available from other packages, and allows for the execution of sampling schemes, 
and for posterior inference. It is intended to give users access to all levels of the design and 
implementation of MCMC simulators to particularly aid in the development of new methods. 
Several software options are available for MCMC sampling of Bayesian models. Individuals 
who are primarily interested in data analysis, unconcerned with the details of MCMC, and have 
models that can be fit in JAGS, Stan, or OpenBUGS are encouraged to use those programs. 
Mamba is intended for individuals who wish to have access to lower-level MCMC tools, are 
knowledgeable of MCMC methodologies, and have experience, or wish to gain experience, 
with their application. The package also provides stand-alone convergence diagnostics and 
posterior inference tools, which are essential for the analysis of MCMC output regardless of 
the software used to generate it. 


aims to provide a collection of useful tools to support machine learning programs, 


including: Data manipulation and preprocessing, Score-based classification, Performance eval- 
uation (e.g. evaluating ROC), Cross validation, and Model tuning (i.e. searching for the best 
settings of parameters). 


stands for ‘Machine Learning Julia’ and is a complete machine learning framework that 


encapsulates many other packages. 


[MXNet . j1|is now part of the Apache MXNet project. It brings flexible and efficient GPU comput- 


ing and state-of-art deep learning to Julia. Some of its features include efficient tensor /matrix 
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computation across multiple devices, including multiple CPUs, GPUs and distributed server 
nodes, and flexible symbolic manipulation to composite and construction of state-of-the-art 
deep learning models. 


[NLopt . j1|provides a Julia interface to the open-source NLopt library for non-linear optimization. 
NLopt provides a common interface for many different optimization algorithms, including, lo- 
cal and global optimization, algorithms that use function values only (no derivative) and 
those that exploit user-supplied gradients, as well as algorithms for unconstrained optimiza- 
tion, bound-constrained optimization, and general non-linear inequality /equality constraints. 
It can be used interchangeably with outer optimization packages such as those from JuMP. 


OnlineStats.jl|is a package which provides on-line algorithms for statistics, models, and data 


visualization. On-line algorithms are well suited for streaming data or when data is too large 
to hold in memory. Observations are processed one at a time and all algorithms use O(1) 
memory. 


is a package that is part of the JuliaNLSolvers family, and provides support for uni- 
variate and multivariate optimization through various kinds of optimization functions. Since 
Optim. jl is written in Julia, it has several advantages: it removes the need for dependen- 
cies that other non-Julia solvers may need, reduces the assumptions the user must make, and 
allows for user controlled choices through Julia’s multiple dispatch rather than relying on pre- 
defined choices made by the package developers. As it is written in Julia, it also has access to 
the automatic differentiation features via packages in the JuliaDiff family. 


Plotly.jlisa Julia interface to the plot.ly plotting library and cloud services, and can be used 
as one of the plotting backends of the Plots. 41 package. 


[POMDPs . j1|is part of the JuliaPOMDP collection of packages and aims to provide an interface 
for defining, solving, and simulating discrete and continuous, fully and partially observable 
Markov decision processes. Note that POMDP.31 only contains the interface for communi- 
cating MDP and POMDP problem definitions. For a full list of supporting packages and 
tools to be used along with POMDPs.jl, see JuliaPOMDP. These additional packages include 
simulators, policies, several different MDP and POMDP solvers, along with other tools. 


is a package that enables the use of a progress meter for long-running Julia 


operations. 


Reinforce. jl is an interface for reinforcement learning. It is intended to connect modular envi- 


ronments, policies, and solvers with a simple interface. Two packages build on Reinforce. jl: 
AtariAlgos.jl, which is an Arcade Learning Environment (ALE) wrapped as Reinforce. jl 
environment, and the OpenAIGym. jl, which wraps the open source python library gym, re- 
leased by OpenAI. 


ReinforcementLearning. jl|is a reinforcement learning package. It features many different 


learning methods and has support for many different learning environments, including a wrap- 
per for the Atari ArcadeLearningEnvironment, and the OpenAI Gym environment, along with 
others. 


ScikitLearn. jl implements the popular scikit-learn interface and algorithms in Julia. It sup- 


ports both models from the Julia ecosystem and those of the scikit-learn library via PyCall.jl. 
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Its main features include approximately 150 Julia and Python models accessed through a uni- 
form interface, Pipelines and FeatureUnions, Cross-validation, hyperparameter tuning, and 
DataFrames support. 


SimJulia. jl is a discrete event process oriented simulation framework written in Julia. It is 
inspired by the Python SimPy library. 


StatsFuns.jl|is a package that provides a collection of mathematical constants and numerical 


functions for statistical computing, including various distribution related functions. 


StatsKit.jlisa convenience meta-package which allows loading of essential packages for statis- 


tics in one command. It currently loads the following statistics packages: Bootstrap, 
CategoricalArrays, Clustering, CSV, DataFrames, Distances, Distributions, 
GLM, HypothesisTests, 
KernelDensity, Loess, MultivariateStatsStatsBase, and TimeSeries. 











Tables. jl combines the best of the DataStreams.;jl and Queryverse. j1 packages to pro- 
vide a set of fast and powerful interface functions for working with various kinds of table-like 
data structures through predictable access patters. 


TensorFlow. jllacts as a wrapper around the popular TensorFlow machine learning framework 


from Google. It enables both input data parsing and post-processing of results to be done 
quickly via Julia’s JIT compilation. It also provides the ability to specify models using native 
Julia looking code, and through Julia metaprogramming, simplifies graph construction and 
reduces code repetition. 


TensorOperations. jl) is a package that enables fast tensor operations using a convenient 


Einstein index notation. 


is a Julia interface of eXtreme Gradient Boosting, or XGBoost. It is an efficient 
and scalable implementation of gradient boosting framework. It includes efficient linear model 
solver and tree learning algorithms. The library is parallelized using OpenMP, and it can be 
more than 10 times faster than some existing gradient boosting packages. It supports various 
objective functions, including regression, classification and ranking. The package is also made 
to be extensible, so that users are also allowed to define their own objectives easily. It is part 
of the Distributed (Deep) Machine Learning Community (dmlc). 
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Organizations 


Much of the Julia package ecosystem on Github is grouped into organizations (or collections) of 
packages, often based on specific domains of knowledge. Currently there are over 35 different Julia 
organizations, and some of the more relevant ones for the statistician, data scientist, or machine 
learning practitioner are listed below. 


JultaCloud is a collection of Julia packages for working with cloud services. 


JuliaDiff an informal organization which aims to unify and document packages written in Julia for 
evaluating derivatives. The technical features of Julia, namely, multiple dispatch, source code 
via reflection, JIT compilation, and first-class access to expression parsing make implementing 
and using techniques from automatic differentiation easier than ever before. Packages hosted 
under the JuliaDiff organization follow the same guidelines as for JuliaOpt; namely, they 
should be actively maintained, well documented and have a basic testing suite. 


JuliaData is a collection of Julia packages for data manipulation, storage, and I/O. 


JultaDiff is an informal organization for solving differential equations in Julia. 


JuliaDiffEgq is an organization for unifying the packages for solving differential equations in Julia, 


and includes packages such as DifferentialEquations.jl. 


JultaGeometry is a collection of packages that focus on computational geometry with Julia. 
Julia G P U contains a collection of Julia packages that support GPU computation. 
JultaGraphs is a collection of Julia packages for graph modeling and analysis. 


Jultalmages is a collection of packages specifically focused on image processing, and has many 


useful algorithms. Its main package is Images.jl. 


Julialnterop is a collection of packages that contains many different packages that enable inter- 


operability between Julia and other various languages, such as C++, Matlab, and others. 
JuliaMath contains a series of mathematics related packages. 
JultaML contains a series of Julia packages for Machine Learning. 


JultaOpt a collection of optimization-related packages. Its purpose is to facilitate collaboration 
among developers of a tightly integrated set of packages for mathematical optimization. 


JuliaParallel is a collection of packages containing various models for parallel programming in 


Julia. 


JuliaPOMDP is a collection of POMDP packages for Julia. 


JuliaPlots is a collection of data visualization plotting packages for Julia. 
JuliaPy is a collection of packages that connect Julia and Python. 


is the main collection of statistics and Machine Learning packages. 
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is a collection of packages for TeX typesetting and rendering in Julia. 


Julia Text is a JuliaLang Organization for Natural Language Processing, (textual) Information 


Retrieval, and Computational Linguistics 


is the landing page for the Juno IDE (integrated desktop environment). Juno is a free 
environment for the Julia language, is built on the Atom editor, and is a powerful development 
tool. The Juno.31 package defines Juno’s frontend API. 
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